Quick Links

Re: better page-level checksums

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: better page-level checksums
Date:	2022-06-14 12:55:56
Message-ID:	CA+TgmoaYfUuWpMG0ynC-z2ofTpFdA6a6QCyrzV3iCmkVjwoYpA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jun 13, 2022 at 5:14 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> It's not that I disagree with (or dislike the idea of) increasing the
> resilience of checksums, I just want to be very careful that we don't
> trade (potentially significant) runtime performance for features
> people might not use. This thread seems very related to the 'storing
> an explicit nonce'-thread, which also wants to reclaim space from a
> page that is currently used by AMs, while AMs would lose access to
> certain information on pages and certain optimizations that they could
> do before. I'm very hesitant to let just any modification to the page
> format go through because someone needs extra metadata attached to a
> page.

Right. So, to be clear, I think there is an opportunity to store ONE
extra blob of data in the page. It might be an extended checksum, or
it might be a nonce for cryptographic authentication, but it can't be
both. I think this is OK, because in earlier discussions of TDE, it
seems that if you're using encryption and also want to verify page
integrity, you'll use an encryption system that produces some kind of
verifier, and you'll store that into this space in the page instead of
using an enhanced-checksum feature.

In other words, I'm imagining creating a space at the end of each page
for some sort of enhanced security or data integrity feature, and you
can either choose not to use one (in which case things work as they do
today), or you can choose an extended checksums feature, or maybe in
the future you can choose some form of TDE that involves storing a
nonce or a page verifier in the page. But you just get one.

Now, the logical question to ask is: well, if there's only one
opportunity to store an extra blob of data on every page, is this the
best way to use it? What if someone comes along with another feature
that also wants to store a blob of data on every page, and they can't
do it because this proposal got there first? My answer is: well, if
that additional feature is something that provides encryption or
tamper-resistance or data integrity or security in any form, then it
can just be added as a new option for how you use this blob of space,
and users who prefer the new thing to the existing options can pick
it. If it's something else, then .... what is it, exactly? It seems to
me that the kinds of things that require space in *every* page of the
cluster are really the things that fall into this category.

For example, Stephen mused earlier that maybe while we're at it we
could find a way to include an XID epoch in every page. Maybe so, but
we wouldn't actually want that in *every* page. We would only want it
in the heap pages. And as far as I can see that's pretty generally how
things go. There are plenty of projects that might want extra space in
each page *for a certain AM* and I don't see any reason why what I
propose to do here would rule that out. I think this and that could
both be done, and doing this might even make doing that easier by
putting in place some useful infrastructure. What I don't think we can
get away with is having multiple systems that are each taking a bite
out of every page for every AM -- but I think that's OK, because I
don't think there's a lot of need for multiple such systems.

> That reminds me, there's one more item to be put on the compatibility
> checklist: Currently, the FSM code assumes it can use all space on a
> page (except the page header) for its total of 3 levels of FSM data.
> Mixing page formats would break how it currently works, as changing
> the space that is available on a page will change the fanout level of
> each leaf in the tree, which our current code can't handle. To change
> the page format of one page in the FSM would thus either require a
> rewrite of the whole FSM fork, or extra metadata attached to the
> relation that details where the format changes. A similar issue exists
> with the VM fork.

I agree with all of this except I think that "mixing page formats" is
a thing we can't do.

> That being said, I think that it could be possible to reuse
> pd_checksum as an extra area indicator between pd_upper and
> pd_special, so that we'd get [pageheader][pd_linp...] pd_lower [hole]
> pd_upper [datas] pd_storage_ext [blackbox] pd_special [special area].
> This should require limited rework in current AMs, especially if we
> provide a global MAX_STORAGE_EXT_SIZE that AMs can use to get some
> upper limit on how much overhead the storage uses per page.

This is an interesting alternative. It's unclear to me that it makes
anything better if the [blackbox] area is before the special area vs.
afterward. And either way, if that area is fixed-size across the
cluster, you don't really need to use pd_checksum to find it, because
you can just know where it is. A possible advantage of this approach
is that it might make it simpler to cope with a scenario where some
pages in the cluster have this blackbox space and others don't. I
wasn't really thinking that on-line page format conversions were
likely to be practical, but certainly the chances are better if we've
got an explicit pointer to the extra space vs. just knowing where it
has to be.

> Alternatively, we could claim some space on a page using a special
> line pointer at the start of the page referring to storage data, while
> having the same limitation on size.

That sounds messy.

> One last option is we recognise that there are two storage locations
> of pages that have different data requirements -- on-disk that
> requires checksums, and in-memory that requires LSNs. Currently, those
> fields are both stored on the page in distinct fields, but we could
> (_could_) update the code to drop LSN when we store the page, and drop
> the checksum when we load the page (at the cost of redo speed when
> recovering from an unclean shutdown). That would provide an extra 64
> bits on the page without breaking storage, assuming AMs don't already
> misuse pd_lsn.

It seems wrong to me to say that we don't need the LSN for a page
stored on disk. Recovery relies on it.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Re: better page-level checksums at 2022-06-13 21:14:37 from Matthias van de Meent

Responses

Re: better page-level checksums at 2022-06-14 15:08:43 from Matthias van de Meent

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2022-06-14 12:56:51	Re: Replica Identity check of partition table on subscriber
Previous Message	Andrew Dunstan	2022-06-14 12:42:55	Re: [v15 beta] pg_upgrade failed if earlier executed with -c switch