Writing out WAL buffers that are still in flux

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Writing out WAL buffers that are still in flux
Date: 2024-11-13 06:46:08
Message-ID: CA+hUKGL8J_mrHyE7Z=8VfkrDgBvHHREnpTfBxDX9NhqBW+woRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

As we learned in the build farm[1], at least one checksumming
filesystem doesn't like it if you write data from memory that is still
changing, when direct I/O is enabled. That was BTRFS, but the next
version of OpenZFS seems to have a similar allergy. In Andres's
thread about fixing that (among other problems) for hint bits, he
pointed out that WAL must have the same problem[2]. I am splitting
that case off to a new thread given that the problem and potential
solutions seem pretty independent.

Concretely, on BTRFS you could write WAL apparently successfully, but
later if you tried to read it you'd get EIO because the checksum was
computed badly over moving data. On ZFS 2.3 (out soon) I think a
backend would crash/panic during the write, so different but also bad.
Some other obvious candidates in that general COW + integrity family
would be (1) APFS but no, it doesn't seem to have user data checksums,
(2) ReFS, which has user data checksums but doesn't enable them by
default and I have no idea what it would do in the same circumstances
(but Windows already explicitly tells us never to touch the buffer
during an I/O, so we're already breaking the rules regardless), (3)
bcachefs, which has user data checksums by default but it looks like
it has its own bounce buffer to defend against this type of
problem[3]. (Anyone got clues about other exotic systems?)

Here's an early experimental patch to try to do something about that,
by making a temporary copy of the moving trailing buffer. It's not
free, but the behaviour is only activated when you turn on
debug_io_direct=wal. No significant change to default behaviour
hopefully, but direct I/O WAL writes look like this:

postgres=# insert into t values (42);
pwrite(14,"\^X\M-Q\^E\0\^A\0\0\0\0\M-`\M^P"...,8192,0x90e000) = 8192

postgres=# insert into t select generate_series(1, 1000);
pwritev(14,[{"\^X\M-Q\^E\0\^A\0\0\0\0`\M^O\r\0"...,57344},
{"\^X\M-Q\^E\0\^A\0\0\0\0(at)\M^P\r\0"...,8192}],2,0x8f6000) = 65536

postgres=# begin;
postgres=*# insert into t select generate_series(1, 10000000);
[walwriter]
pwrite(5,"\^X\M-Q\^E\0\^A\0\0\0\0(at)`\^Z\0\0"...,2080768,0x604000) = 2080768

In the first case it had to take a copy as it was only writing the
tail, but otherwise it looks like unpatched. In the second case, it
wrote as much as it could safely directly from WAL buffers, but then
had to write the tail page out with a temporary copy. In the third
case, bulk writing activity doesn't do partial pages so gets the
traditional behaviour.

It might also be an idea to copy only the the part we really care
about and zero the rest, and in either case perhaps only out to the
nearest PG_IO_ALIGN_SIZE (step size at which it is OK to end a direct
I/O write), though that last bit might not be worth the hassle if we
plan to just use smaller blocks anyway.

Interaction with other potential future changes: Obviously AIO might
require something more... bouncy than a stack buffer. 4K WAL buffers
would reduce the size of the copied region (along with other
benefits[4]). A no-overwrite WAL buffer mode (to match the behaviour
of other databases) would avoid the cost completely (along with other
benefits).

What else could we do, if not something like this?

[1] https://www.postgresql.org/message-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com
[2] https://www.postgresql.org/message-id/jo5p5nthb3hxwfj7qifiu2rxk5y3hexbwxs5d6x2jotsvj3bq5%40jhtrriubncjb
[3] https://bcachefs-docs.readthedocs.io/en/latest/feat-checksumming.html
[4] https://www.postgresql.org/message-id/flat/20231009230805.funj5ipoggjyzjz6%40awork3.anarazel.de

Attachment Content-Type Size
0001-Don-t-write-changing-WAL-buffers-with-direct-I-O.patch text/x-patch 3.7 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-11-13 06:49:59 Re: Reordering DISTINCT keys to match input path's pathkeys
Previous Message Bertrand Drouvot 2024-11-13 06:41:08 Re: define pg_structiszero(addr, s, r)