Quick Links

Re: Lowering the default wal_blocksize to 4K

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: Lowering the default wal_blocksize to 4K
Date:	2023-10-11 14:09:21
Message-ID:	CAEze2WhMPhcnA3Py+DaKGr9_jepK8=B=pXx=ruHFWGtrGh8LCw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:
> > On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > On 2023-10-09 23:16:30 -0400, Tom Lane wrote:
> > >> Andres Freund <andres(at)anarazel(dot)de> writes:
> > >>> There's an alternative approach we could take, which is to write in 4KB
> > >>> increments, while keeping 8KB pages. With the current format that's not
> > >>> obviously a bad idea. But given there aren't really advantages in 8KB WAL
> > >>> pages, it seems we should just go for 4KB?
> > >>
> > >> Seems like that's doubling the overhead of WAL page headers. Do we need
> > >> to try to skinny those down?
> > >
> > > I think the overhead is small, and we are wasting so much space in other
> > > places, that I am not worried about the proportional increase page header
> > > space usage at this point, particularly compared to saving in overall write
> > > rate and increase in TPS. There's other areas we can save much more space, if
> > > we want to focus on that.
> > >
> > > I was thinking we should perhaps do the opposite, namely getting rid of short
> > > page headers. The overhead in the "byte position" <-> LSN conversion due to
> > > the differing space is worse than the gain. Or do something inbetween - having
> > > the system ID in the header adds a useful crosscheck, but I'm far less
> > > convinced that having segment and block size in there, as 32bit numbers no
> > > less, is worthwhile. After all, if the system id matches, it's not likely that
> > > the xlog block or segment size differ.
> >
> > Hmm. I don't think we should remove those checks, as I can see people
> > that would want to change their XLog block size with e.g.
> > pg_reset_wal.
>
> I don't think that's something we need to address in every physical
> segment. For one, there's no option to do so.

Not block size, but xlog segment size is modifiable with pg_resetwal,
and could thus reasonably change across restarts. Apart from more
practical concerns around compile-time options requiring you to swap
out binaries, I don't really see why xlog block size couldn't be
changed with pg_resetwal in a securely shutdown cluster as one does
with the WAL segment size.

> But more importantly, if they
> don't change the xlog block size, we'll just accept random WAL as well. If
> somebody goes to the trouble of writing a custom tool, they can live with the
> consequences of that potentially causing breakage. Particularly if the checks
> wouldn't meaningfully prevent that anyway.

I don't understand what you mean by that "we'll just accept random WAL
as well". We do significant validation in XLogReaderValidatePageHeader
to make sure that all pages of WAL are sufficiently formatted so that
they can securely be read by the available infrastructure with the
least chance of misreading data. There is no chance currently that we
read WAL from WAL segments that contain correct data for different
segment or block sizes. That includes WAL from segments created before
a pg_resetwal changed the WAL segment size.

If this "custom tool" refers to the typo-ed name of pg_resetwal, that
is hardly a custom tool, it is shipped with PostgreSQL and you can
find the sources under src/bin/pg_resetwal.

> > After that we'll only have the system ID left from the extended
> > header, which we could store across 2 pages in the (current) alignment
> > losses of xlp_rem_len - even pages the upper half, uneven pages the
> > lower half of the ID. This should allow for enough integrity checks
> > without further increasing the size of XLogPageHeader in most
> > installations.
>
> I doubt that that's a good idea - what if there's just a single page in a
> segment? And there aren't earlier segments? That's not a rare case, IME.

Then we'd still have 50% of a system ID which we can check against for
any corruption. I agree that it increases the chance of conflics, but
it's still strictly better than nothing at all.
An alternative solution would be to write the first two pages of a WAL
segment regardless of contents, so that we essentially never only have
access to the first page during crash recovery. Physical replication's
recovery wouldn't be able to read ahead, but I consider that as less
problematic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Re: Lowering the default wal_blocksize to 4K at 2023-10-10 23:29:33 from Andres Freund

Responses

Re: Lowering the default wal_blocksize to 4K at 2023-10-11 22:16:33 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tommy Pavlicek	2023-10-11 15:11:00	Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges
Previous Message	Robert Haas	2023-10-11 12:35:46	Re: CREATE DATABASE with filesystem cloning