Re: Large block sizes support in Linux

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: "Pankaj Raghav (Samsung)" <kernel(at)pankajraghav(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com
Subject: Re: Large block sizes support in Linux
Date: 2024-03-23 04:53:20
Message-ID: CA+hUKGKfAy4ypKO9dEsvP5b-gLGrzxCa9G5cd3AmMx0jes5vgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
<kernel(at)pankajraghav(dot)com> wrote:
> My team and I have been working on adding Large block size(LBS)
> support to XFS in Linux[1]. Once this feature lands upstream, we will be
> able to create XFS with FS block size > page size of the system on Linux.
> We also gave a talk about it in Linux Plumbers conference recently[2]
> for more context. The initial support is only for XFS but more FSs will
> follow later.

Very cool!

(I used XFS on IRIX in the 90s, and it had large blocks then, a
feature lost in the port to Linux AFAIK.)

> On an x86_64 system, fs block size was limited to 4k, but traditionally
> Postgres uses 8k as its default internal page size. With LBS support,
> fs block size can be set to 8K, thereby matching the Postgres page size.
>
> If the file system block size == DB page size, then Postgres can have
> guarantees that a single DB page will be written as a single unit during
> kernel write back and not split.
>
> My knowledge of Postgres internals is limited, so I'm wondering if there
> are any optimizations or potential optimizations that Postgres could
> leverage once we have LBS support on Linux?

FWIW here are a couple of things I wrote about our storage atomicity
problem, for non-PostgreSQL hackers who may not understand our project
jargon:

https://wiki.postgresql.org/wiki/Full_page_writes
https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf

The short version is that we (and MySQL, via a different scheme with
different tradeoffs) could avoid writing all our stuff out twice if we
could count on atomic writes of a suitable size on power failure, so
the benefits are very large. As far as I know, there are two things
we need from the kernel and storage to do that on "overwrite"
filesystems like XFS:

1. The disk must promise that its atomicity-on-power-failure is a
multiple of our block size -- something like NVMe AWUPF, right? My
devices seem to say 0 :-( Or I guess the filesystem has to
compensate, but then it's not exactly an overwrite filesystem
anymore...

2. The kernel must promise that there is no code path in either
buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or
other configured block size) writes on some smaller boundary, most
likely sector I guess, on their way to the device, as you were saying.
Not just in happy cases, but even under memory pressure, if
interrupted, etc etc.

Sounds like you're working on problem #2 which is great news.

I've been wondering for a while how a Unixoid kernel should report
these properties to userspace where it knows them, especially on
non-overwrite filesystems like ZFS where this sort of thing works
already, without stuff like AWUPF working the way one might hope.
Here was one throw-away idea on the back of a napkin about that, for
what little it's worth:

https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-03-23 05:06:18 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message jian he 2024-03-23 03:02:28 Re: SQL:2011 application time