Re: Large block sizes support in Linux

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Pankaj Raghav <kernel(at)pankajraghav(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, p(dot)raghav(at)samsung(dot)com, mcgrof(at)kernel(dot)org, gost(dot)dev(at)samsung(dot)com
Subject: Re: Large block sizes support in Linux
Date: 2024-03-27 22:13:00
Message-ID: ZgSZ7JYL+Yqg/jiC@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Pankaj Raghav (kernel(at)pankajraghav(dot)com) wrote:
> On 23/03/2024 05:53, Thomas Munro wrote:
> > On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
> > <kernel(at)pankajraghav(dot)com> wrote:
> >> My team and I have been working on adding Large block size(LBS)
> >> support to XFS in Linux[1]. Once this feature lands upstream, we will be
> >> able to create XFS with FS block size > page size of the system on Linux.
> >> We also gave a talk about it in Linux Plumbers conference recently[2]
> >> for more context. The initial support is only for XFS but more FSs will
> >> follow later.
> >
> > Very cool!

Yes, this is very cool sounding and could be a real difference for PG.

> > (I used XFS on IRIX in the 90s, and it had large blocks then, a
> > feature lost in the port to Linux AFAIK.)
>
> Yes, I heard this also from the Maintainer of XFS that they had to drop
> this functionality when they did the port. :)

I also recall the days of XFS on IRIX... Many moons ago.

> > The short version is that we (and MySQL, via a different scheme with
> > different tradeoffs) could avoid writing all our stuff out twice if we
> > could count on atomic writes of a suitable size on power failure, so
> > the benefits are very large. As far as I know, there are two things
> > we need from the kernel and storage to do that on "overwrite"
> > filesystems like XFS:
> >
> > 1. The disk must promise that its atomicity-on-power-failure is a
> > multiple of our block size -- something like NVMe AWUPF, right? My
> > devices seem to say 0 :-( Or I guess the filesystem has to
> > compensate, but then it's not exactly an overwrite filesystem
> > anymore...
>
> 0 means 1 logical block, which might be 4k in your case. Typically device
> vendors have to put extra hardware to guarantee bigger atomic block sizes.

If I'm following correctly, this would mean that PG with FPW=off
(assuming everything else works) would be safe on more systems if PG
supported a 4K block size than if PG only supports 8K blocks, right?

There's been discussion and even some patches posted around the idea of
having run-time support in PG for different block sizes. Currently,
it's a compile-time option with the default being 8K, meaning that's the
only option on a huge number of the deployed PG environments out there.
Moving it to run-time has some challenges and there's concerns about the
performance ... but if it meant we could run safely with FPW=off, that's
a pretty big deal. On the other hand, if the expectation is that
basically everything will support atomic 8K, then we might be able to
simply keep that and not deal with supporting different page sizes at
run-time (of course, this is only one of the considerations in play, but
it could be particularly key, if I'm following correctly).

Appreciate any insights you can share on this.

Thanks!

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2024-03-27 22:13:22 Re: Possibility to disable `ALTER SYSTEM`
Previous Message Dave Cramer 2024-03-27 22:10:20 Re: incorrect results and different plan with 2 very similar queries