Re: Large block sizes support in Linux

From: Pankaj Raghav <kernel(at)pankajraghav(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com
Subject: Re: Large block sizes support in Linux
Date: 2024-03-25 14:34:07
Message-ID: 527bb89f-18a3-4551-accb-d4d4f97c2151@pankajraghav.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Thomas,

On 23/03/2024 05:53, Thomas Munro wrote:
> On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
> <kernel(at)pankajraghav(dot)com> wrote:
>> My team and I have been working on adding Large block size(LBS)
>> support to XFS in Linux[1]. Once this feature lands upstream, we will be
>> able to create XFS with FS block size > page size of the system on Linux.
>> We also gave a talk about it in Linux Plumbers conference recently[2]
>> for more context. The initial support is only for XFS but more FSs will
>> follow later.
>
> Very cool!
>
> (I used XFS on IRIX in the 90s, and it had large blocks then, a
> feature lost in the port to Linux AFAIK.)
>

Yes, I heard this also from the Maintainer of XFS that they had to drop
this functionality when they did the port. :)

>> On an x86_64 system, fs block size was limited to 4k, but traditionally
>> Postgres uses 8k as its default internal page size. With LBS support,
>> fs block size can be set to 8K, thereby matching the Postgres page size.
>>
>> If the file system block size == DB page size, then Postgres can have
>> guarantees that a single DB page will be written as a single unit during
>> kernel write back and not split.
>>
>> My knowledge of Postgres internals is limited, so I'm wondering if there
>> are any optimizations or potential optimizations that Postgres could
>> leverage once we have LBS support on Linux?
>
> FWIW here are a couple of things I wrote about our storage atomicity
> problem, for non-PostgreSQL hackers who may not understand our project
> jargon:
>
> https://wiki.postgresql.org/wiki/Full_page_writes
> https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf
>
This is very useful, thanks a lot.

> The short version is that we (and MySQL, via a different scheme with
> different tradeoffs) could avoid writing all our stuff out twice if we
> could count on atomic writes of a suitable size on power failure, so
> the benefits are very large. As far as I know, there are two things
> we need from the kernel and storage to do that on "overwrite"
> filesystems like XFS:
>
> 1. The disk must promise that its atomicity-on-power-failure is a
> multiple of our block size -- something like NVMe AWUPF, right? My
> devices seem to say 0 :-( Or I guess the filesystem has to
> compensate, but then it's not exactly an overwrite filesystem
> anymore...
>

0 means 1 logical block, which might be 4k in your case. Typically device
vendors have to put extra hardware to guarantee bigger atomic block sizes.

> 2. The kernel must promise that there is no code path in either
> buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or
> other configured block size) writes on some smaller boundary, most
> likely sector I guess, on their way to the device, as you were saying.
> Not just in happy cases, but even under memory pressure, if
> interrupted, etc etc.
>
> Sounds like you're working on problem #2 which is great news.
>

Yes, you are spot on. :)

> I've been wondering for a while how a Unixoid kernel should report
> these properties to userspace where it knows them, especially on
> non-overwrite filesystems like ZFS where this sort of thing works

So it looks like ZFS (or any other CoW filesystem that supports larger
block sizes) is doing what postgres will do anyway with FPW=on, making
it safe to turn off FPW.

One question: Does ZFS do something like FUA request to force the device
to clear the cache before it can update the node to point to the new page?

If it doesn't do it, there is no guarantee from device to update the data
atomically unless it has bigger atomic guarantees?

> already, without stuff like AWUPF working the way one might hope.
> Here was one throw-away idea on the back of a napkin about that, for
> what little it's worth:
> > https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

As I replied in the previous mail to Tomas, we might be having a talk
about Untorn writes[1] in LSFMM this year. I hope to bring up some of the
discussions from here. Thanks!

[1] https://lore.kernel.org/linux-fsdevel/20240228061257(dot)GA106651(at)mit(dot)edu/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-03-25 14:35:10 Re: Add bump memory context type and use it for tuplesorts
Previous Message Robert Haas 2024-03-25 14:32:03 Re: pg_upgrade --copy-file-range