From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Noah Misch <noah(at)leadboat(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Direct I/O |
Date: | 2023-04-09 01:55:33 |
Message-ID: | CA+hUKG+Hw8R-KtNMAX+-CyuBYp5N1MFo7mcXZuor=yQoMBDTow@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Apr 9, 2023 at 11:05 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Googling finds a lot of suggestions that O_DIRECT doesn't play nice
> with btrfs, for example
>
> https://www.mail-archive.com/linux-btrfs(at)vger(dot)kernel(dot)org/msg92824.html
>
> It's not clear to me how much of that lore is still current,
> but it's disturbing.
I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums). I don't think we should ever do that ourselves (not least
because it'd break our own checksums). We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.
Here's something recent. I guess it's probably not relevant (a fault
on our buffer that we recently touched sounds pretty unlikely), but
who knows... (developer lists for file systems are truly terrifying
places to drive through).
https://lore.kernel.org/linux-btrfs/20230315195231(dot)GW10580(at)twin(dot)jikos(dot)cz/T/
It's odd, though, if it is their bug and not ours: I'd expect our
friends in other databases to have hit all that sort of thing years
ago, since many comparable systems have a direct I/O knob*. What are
we doing differently? Are our multiple processes a factor here,
breaking some coherency logic? Unsurprisingly, having compression on
as Andrew does actually involves buffering anyway[1] despite our
O_DIRECT flag, but maybe that's saying writes are buffered but reads
are still direct (?), which sounds like the sort of initial conditions
that might produce a coherency bug. I dunno.
I gather that btrfs is actually Fedora's default file system (or maybe
it's just "laptops and desktops"[2]?). I wonder if any of the several
green Fedora systems in the 'farm are using btrfs. I wonder if they
are using different mount options (thinking again of compression).
*Probably a good reason to add a more prominent warning that the
feature is developer-only, experimental and not for production use.
I'm thinking a warning at startup or something.
[1] https://btrfs.readthedocs.io/en/latest/Compression.html
[2] https://fedoraproject.org/wiki/Changes/BtrfsByDefault
From | Date | Subject | |
---|---|---|---|
Next Message | Yu Shi (Fujitsu) | 2023-04-09 02:04:46 | RE: PGDOCS - function pg_get_publication_tables is not documented? |
Previous Message | Tom Lane | 2023-04-09 01:54:38 | Re: Proposal: %T Prompt parameter for psql for current time (like Oracle has) |