Re: better page-level checksums

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: better page-level checksums
Date: 2022-06-10 00:00:05
Message-ID: CAEze2Whg1p=fEs6-kyRCpKhs5uu5kQbMLDp0NVbcr8oJqaghcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 9 Jun 2022 at 23:13, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Oct 7, 2021 at 11:50 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > Alternatively, we could use
> > that technique to just provide a better per-page checksum than what we
> > have today. Maybe we could figure out how to leverage that to move to
> > 64bit transaction IDs with some page-level epoch.
>
> I'm interested in assessing the feasibility of a "better page-level
> checksums" feature. I have a few questions, and a few observations.
> One of my questions is what algorithm(s) we'd want to support. I did a
> quick Google search and found that brtfs supports CRC-32C, XXHASH,
> SHA256, and BLAKE2B. I don't know that we want to support that many
> options (but maybe we do) and I don't think CRC-32C makes any sense
> here, for two reasons. First, we've already got a 16-bit checksum, and
> a 32-bit checksum doesn't seem like it's gaining enough to be worth
> the implementation complexity. Second, we're probably going to have to
> dole out per-page space in multiples of MAXALIGN, and that's usually
> 8.

Why so? We already dole out per-page space in 4-byte increments
through pd_linp, and I see no reason why we can't reserve some line
pointers for per-page metadata if we decide that we need extra
per-page ~overhead~ metadata.

> I think for this purpose we should limit ourselves to algorithms
> whose output size is, at minimum, 64 bits, and ideally, a multiple of
> 64 bits. I'm sure there are plenty of options other than the ones that
> btrfs uses; I mentioned them only as a way of jump-starting the
> discussion. Note that SHA-256 and BLAKE2B apparently emit enormously
> wide 16 BYTE checksums. That's a lot of space to consume with a
> checksum, but your chances of a collision are very small indeed.

Isn't the goal of a checksum to find - and where possible, correct -
bit flips and other broken pages? I would suggest not to use
cryptographic hash functions for that, as those are rarely
error-correcting.

> Even if we only offer one new kind of checksum, making space for a
> wider checksum makes the page format variable in a way that it
> currently isn't. There seem to be about 50 compile-time constants in
> the source code whose values are computed based on the block size and
> amount of special space in use by some particular AM (yikes!).

Isn't that expected for most of those places? With the current
bufpage.h description of Page, it seems obvious that all bytes on a
page except those in the "hole" and those in the page header are under
full control of the AM. Of course AMs will pre-calculate limits and
offsets during compilation, that saves recalculation cycles and/or
cache lines with constants to keep in L1.

> For
> example, for the heap, there's stuff like MaxHeapTuplesPerPage and
> MaxHeapTupleSize. If in the future we have some pages that are just
> like the ones we have today, and other clusters where we've allowed
> space for a checksum, then those constants become run-time variable.
> And since they're used in some very low-level functions that are
> called a lot, like PageGetHeapFreeSpace(), that seems potentially
> problematic. The problem is multiplied if you also think about trying
> to store an epoch on each heap page, as per Stephen's proposal above,
> because now every page used by any AM whatsoever might or might not
> have a checksum, and every heap page might also have or not have an
> epoch XID. I think it's going to be somewhat tricky to figure out a
> scheme here that avoids making things slow.

Can't we add some extra fork that stores this extra per-page
information, and contains this extra metadata in a double-buffered
format, so that both before the actual page is written the metadata
too is written to disk, while the old metadata is available too for
recovery purposes. This allows us to maintain the current format with
its low per-page overhead, and only have extra overhead (up to 2x
writes for each page, but the writes for these metadata pages need not
be BLCKSZ in size) for those that opt-in to the more computationally
expensive features of larger checksums, nonces, and/or other non-AM
per-page ~overhead~ metadata.

> Originally I was thinking
> that things like MaxHeapTuplesPerPage ought to become macros or static
> inline functions, but now I have what I think is a better idea: make
> them into global variables and initialize them with the appropriate
> values for the cluster right after we read the control file. This
> doesn't solve the problem if some pages are different than others,
> though, and even for the case where every page in the cluster has the
> same amount of reserved space, reading a global variable is not going
> to be as efficient as referring to a constant compiled right into the
> code. I'm hopeful that all of this is solvable somehow, but it's
> hairy, for sure.
>
> Another thing I realized is that we would probably end up with the
> pd_checksum unused when this other feature is activated. If someone
> comes up with a clever idea for how to allocate extra space without
> needing things to be a multiple of MAXIMUM_ALIGNOF, they could
> potentially shrink the space they need elsewhere by 2 bytes and then
> use both that space and pd_checksum, but otherwise pd_checksum is
> likely to be dead when an enhanced checksum feature is in use. Since
> it's also dead when checksums are turned off, that's probably OK. I
> suppose another possibility is to allow both to be turned on and off
> independently, i.e. let someone have both a Fletcher-16 checksum in
> pd_checksum, and also a wider checksum in this other chunk of space,
> but I'm not sure whether that's really a useful thing to be able to
> do. (Opinions?)

I'd prefer if we didn't change the way pages are presented to AMs.
Currently, it is clear what area is available to you if you write an
AM that uses the bufpage APIs. Changing the page format to have the
buffer manager also touch / reserve space in the special areas seems
like a break of abstraction: Quoting from bufpage.h:

* AM-generic per-page information is kept in PageHeaderData.
*
* AM-specific per-page data (if any) is kept in the area marked "special
* space"; each AM has an "opaque" structure defined somewhere that is
* stored as the page trailer. an access method should always
* initialize its pages with PageInit and then set its own opaque
* fields.

I'd rather we keep this contract: am-generic stuff belongs in
PageHeaderData, with the rest of the page fully available for the AM
to use (including the special area).

Kind regards,

Matthias van de Meent

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-06-10 00:18:06 Re: Collation version tracking for macOS
Previous Message Peter Geoghegan 2022-06-09 23:59:47 Re: Collation version tracking for macOS