From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | "Pankaj Raghav (Samsung)" <kernel(at)pankajraghav(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com |
Subject: | Re: Large block sizes support in Linux |
Date: | 2024-03-23 04:53:20 |
Message-ID: | CA+hUKGKfAy4ypKO9dEsvP5b-gLGrzxCa9G5cd3AmMx0jes5vgg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
<kernel(at)pankajraghav(dot)com> wrote:
> My team and I have been working on adding Large block size(LBS)
> support to XFS in Linux[1]. Once this feature lands upstream, we will be
> able to create XFS with FS block size > page size of the system on Linux.
> We also gave a talk about it in Linux Plumbers conference recently[2]
> for more context. The initial support is only for XFS but more FSs will
> follow later.
Very cool!
(I used XFS on IRIX in the 90s, and it had large blocks then, a
feature lost in the port to Linux AFAIK.)
> On an x86_64 system, fs block size was limited to 4k, but traditionally
> Postgres uses 8k as its default internal page size. With LBS support,
> fs block size can be set to 8K, thereby matching the Postgres page size.
>
> If the file system block size == DB page size, then Postgres can have
> guarantees that a single DB page will be written as a single unit during
> kernel write back and not split.
>
> My knowledge of Postgres internals is limited, so I'm wondering if there
> are any optimizations or potential optimizations that Postgres could
> leverage once we have LBS support on Linux?
FWIW here are a couple of things I wrote about our storage atomicity
problem, for non-PostgreSQL hackers who may not understand our project
jargon:
https://wiki.postgresql.org/wiki/Full_page_writes
https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf
The short version is that we (and MySQL, via a different scheme with
different tradeoffs) could avoid writing all our stuff out twice if we
could count on atomic writes of a suitable size on power failure, so
the benefits are very large. As far as I know, there are two things
we need from the kernel and storage to do that on "overwrite"
filesystems like XFS:
1. The disk must promise that its atomicity-on-power-failure is a
multiple of our block size -- something like NVMe AWUPF, right? My
devices seem to say 0 :-( Or I guess the filesystem has to
compensate, but then it's not exactly an overwrite filesystem
anymore...
2. The kernel must promise that there is no code path in either
buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or
other configured block size) writes on some smaller boundary, most
likely sector I guess, on their way to the device, as you were saying.
Not just in happy cases, but even under memory pressure, if
interrupted, etc etc.
Sounds like you're working on problem #2 which is great news.
I've been wondering for a while how a Unixoid kernel should report
these properties to userspace where it knows them, especially on
non-overwrite filesystems like ZFS where this sort of thing works
already, without stuff like AWUPF working the way one might hope.
Here was one throw-away idea on the back of a napkin about that, for
what little it's worth:
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2024-03-23 05:06:18 | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Previous Message | jian he | 2024-03-23 03:02:28 | Re: SQL:2011 application time |