Re: AIO writes vs hint bits vs checksums

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: AIO writes vs hint bits vs checksums
Date: 2024-09-26 21:56:34
Message-ID: CA+hUKGJsndPVmEOcgWeKnZit-u6pOWnGaq0pACXOQfn79sfDwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > However, our habit of modifying buffers while IO is going on is
> > causing issues with filesystem level checksums as well, as evidenced by the
> > fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
> > tend to think it'd be better to just stop doing that alltogether (we also do
> > that for WAL, when writing out a partial page, but a potential fix there would
> > be different, I think).
>
> +many. Interesting point re the WAL variant. For the record, here's
> some discussion and a repro for that problem, which Andrew currently
> works around in a build farm animal with mount options:
>
> https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com

Here's an interesting new development in that area, this time from
OpenZFS, which committed its long awaited O_DIRECT support a couple of
weeks ago[1] and seems to have taken a different direction since that
last discussion. Clearly it has the same checksum stability problem
as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of
avoiding copying and caching must confront that and break *something*,
or accept something like bounce buffers and give up the zero-copy
goal. Curiously, they seem to have landed on two different solutions
with three different possible behaviours: (1) On FreeBSD, temporarily
make the memory non-writeable, (2) On Linux, they couldn't do that so
they have an extra checksum verification on write. I haven't fully
grokked all this yet, or even tried it, and it's not released or
anything, but it looks a bit like all three behaviours are bad for our
current hint bit design: on FreeBSD, setting a hint bit might crash
(?) if a write is in progress in another process, and on Linux,
depending on zfs_vdev_direct_write_verify, either the concurrent write
might fail (= checkpointer failing on EIO because someone concurrently
set a hint bit) or a later read might fail (= file is permanently
corrupted and you don't find out until later, like btrfs). I plan to
look more closely soon and see if I understood that right...

[1] https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-09-26 22:04:19 Re: pgsql: Implement pg_wal_replay_wait() stored procedure
Previous Message Nathan Bossart 2024-09-26 21:33:06 Re: MAINTAIN privilege -- what do we need to un-revert it?