Quick Links

Re: better page-level checksums

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: better page-level checksums
Date:	2022-06-13 21:14:37
Message-ID:	CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5=zg63nKtHBnn8A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 10 Jun 2022 at 15:58, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jun 9, 2022 at 8:00 PM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > Why so? We already dole out per-page space in 4-byte increments
> > through pd_linp, and I see no reason why we can't reserve some line
> > pointers for per-page metadata if we decide that we need extra
> > per-page ~overhead~ metadata.
>
> Hmm, that's an interesting approach. I was thinking that putting data
> after the PageHeaderData struct would be a non-starter because the
> code that looks up a line pointer by index is currently just
> multiply-and-add and complicating it seems bad for performance.
> However, if we treated the space there as overlapping the line pointer
> array and making some line pointers unusable rather than something
> inserted prior to the line pointer array, we could avoid that. I still
> think it would be kind of complicated, though, because we'd have to
> find every bit of code that loops over the line pointer array or
> accesses it by index and make sure that it doesn't try to access the
> low-numbered line pointers.
>
> > Isn't the goal of a checksum to find - and where possible, correct -
> > bit flips and other broken pages? I would suggest not to use
> > cryptographic hash functions for that, as those are rarely
> > error-correcting.
>
> I wasn't thinking of trying to do error correction, just error
> detection. See also my earlier reply to Peter Geoghegan.

The use of CRC in our current page format implies that we can correct
(some) bit errors, which is why I presumed that that was a goal of
page checksums. I stand corrected.

> > Isn't that expected for most of those places? With the current
> > bufpage.h description of Page, it seems obvious that all bytes on a
> > page except those in the "hole" and those in the page header are under
> > full control of the AM. Of course AMs will pre-calculate limits and
> > offsets during compilation, that saves recalculation cycles and/or
> > cache lines with constants to keep in L1.
>
> Yep.
>
> > Can't we add some extra fork that stores this extra per-page
> > information, and contains this extra metadata in a double-buffered
> > format, so that both before the actual page is written the metadata
> > too is written to disk, while the old metadata is available too for
> > recovery purposes. This allows us to maintain the current format with
> > its low per-page overhead, and only have extra overhead (up to 2x
> > writes for each page, but the writes for these metadata pages need not
> > be BLCKSZ in size) for those that opt-in to the more computationally
> > expensive features of larger checksums, nonces, and/or other non-AM
> > per-page ~overhead~ metadata.
>
> It's not impossible, I'm sure, but it doesn't seem very appealing to
> me. Those extra reads and writes could be expensive, and there's no
> place to cleanly integrate them into the code structure. A function
> like PageIsVerified() -- which is where we currently validate
> checksums -- only gets the page. It can't go off and read some other
> page from disk to perform the checksum calculation.

It could be part of the buffer IO code to provide
PageIsVerifiedExtended with a pointer to the block metadata buffer.

> I'm not exactly sure what you have in mind when you say that the
> writes need not be BLCKSZ in size.

What I meant was that when the extra metadata is stored seperately
from the block itself, it could be written directly to the file offset
instead of having to track BLCKSZ data for N blocks, so the
metadata-write would be << BLCKSZ in length, while the block itself
would still be the normal BLCKSZ write.

> Technically I guess that's true,
> but then the checksums have to be crash safe, or they're not much
> good. If they're not part of the page, how do they get updated in a
> way that makes them crash safe? I guess it could be done: every time
> we write a FPW, enlarge the page image by the number of bytes that are
> stored in this location. When replaying an FPW, update those bytes
> too. And every time we read or write a page, also read or write those
> bytes. In essence, we'd be deciding that pages are 8192+n bytes, but
> the last n bytes are stored in a different file - and, in memory, a
> different buffer pool. I think that would be hugely invasive and
> unpleasant to make work and I think the performance would be poor,
> too.

I agree that this wouldn't be as performant from a R/W perspective as
keeping that metadata inside the block. But on the other hand, that is
only for block R/W operations, and not for in-memory block
manipulations.

> > I'd prefer if we didn't change the way pages are presented to AMs.
> > Currently, it is clear what area is available to you if you write an
> > AM that uses the bufpage APIs. Changing the page format to have the
> > buffer manager also touch / reserve space in the special areas seems
> > like a break of abstraction: Quoting from bufpage.h:
> >
> > * AM-generic per-page information is kept in PageHeaderData.
> > *
> > * AM-specific per-page data (if any) is kept in the area marked "special
> > * space"; each AM has an "opaque" structure defined somewhere that is
> > * stored as the page trailer. an access method should always
> > * initialize its pages with PageInit and then set its own opaque
> > * fields.
> >
> > I'd rather we keep this contract: am-generic stuff belongs in
> > PageHeaderData, with the rest of the page fully available for the AM
> > to use (including the special area).
>
> I don't think that changing the contract has to mean that it becomes
> unclear what the contract is. And you can't improve any system without
> changing some stuff. But you certainly don't have to like my ideas or
> anything....

It's not that I disagree with (or dislike the idea of) increasing the
resilience of checksums, I just want to be very careful that we don't
trade (potentially significant) runtime performance for features
people might not use. This thread seems very related to the 'storing
an explicit nonce'-thread, which also wants to reclaim space from a
page that is currently used by AMs, while AMs would lose access to
certain information on pages and certain optimizations that they could
do before. I'm very hesitant to let just any modification to the page
format go through because someone needs extra metadata attached to a
page.

That reminds me, there's one more item to be put on the compatibility
checklist: Currently, the FSM code assumes it can use all space on a
page (except the page header) for its total of 3 levels of FSM data.
Mixing page formats would break how it currently works, as changing
the space that is available on a page will change the fanout level of
each leaf in the tree, which our current code can't handle. To change
the page format of one page in the FSM would thus either require a
rewrite of the whole FSM fork, or extra metadata attached to the
relation that details where the format changes. A similar issue exists
with the VM fork.

That being said, I think that it could be possible to reuse
pd_checksum as an extra area indicator between pd_upper and
pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole]
pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area].
This should require limited rework in current AMs, especially if we
provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some
upper limit on how much overhead the storage uses per page.
Alternatively, we could claim some space on a page using a special
line pointer at the start of the page referring to storage data, while
having the same limitation on size.
One last option is we recognise that there are two storage locations
of pages that have different data requirements -- on-disk that
requires checksums, and in-memory that requires LSNs. Currently, those
fields are both stored on the page in distinct fields, but we could
(_could_) update the code to drop LSN when we store the page, and drop
the checksum when we load the page (at the cost of redo speed when
recovering from an unclean shutdown). That would provide an extra 64
bits on the page without breaking storage, assuming AMs don't already
misuse pd_lsn.

- Matthias

In response to

Re: better page-level checksums at 2022-06-10 13:58:40 from Robert Haas

Responses

Re: better page-level checksums at 2022-06-14 12:55:56 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2022-06-13 21:32:36	Re: pltcl crash on recent macOS
Previous Message	Peter Eisentraut	2022-06-13 21:05:32	Re: pltcl crash on recent macOS