Re: [PATCHES] Post-special page storage TDE support

From: David Christensen <david(dot)christensen(at)crunchydata(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: [PATCHES] Post-special page storage TDE support
Date: 2023-11-13 20:03:36
Message-ID: CAOxo6XJSXQJmg0guCn=tN=6PdDgYB_jV6dA1esEKPX97_v0Drg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 7, 2023 at 6:20 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen <david(at)pgguru(dot)net>
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > encrypted data, extended checksums, and potentially other things. This
> data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime
> based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't
> be
> supported across this commit, rather than just between different
> settings...
>

Thanks for the review.

> I think as a whole this is not an insane idea. A few comments:
>
> - IMO the patch touches many places it shouldn't need to touch, because of
> essentially renaming a lot of existing macro names to *Limit,
> necessitating modifying a lot of users. I think instead the few places
> that
> care about the runtime limit should be modified.
>
> As-is the patch would cause a lot of fallout in extensions that just do
> things like defining an on-stack array of Datums or such - even though
> all
> they'd need is to change the define to the *Limit one.
>
> Even leaving extensions aside, it must makes reviewing (and I'm sure
> maintaining) the patch very tedious.
>

You make a good point, and I think you're right that we could teach the
places that care about runtime vs compile time differences about the
changes while leaving other callers alone. The *Limit ones were introduced
since we need constant values here from the Calc...() macros, but could try
keeping the existing *Limit with the old name and switching things around.
I suspect there will be the same amount of code churn, but less mechanical.

> - I'm a bit worried about how the extra special page will be managed - if
> there are multiple features that want to use it, who gets to put their
> data
> at what offset?
>
> After writing this I saw that 0002 tries to address this - but I don't
> like
> the design. It introduces runtime overhead that seems likely to be
> visible.
>

Agreed this could be optimized.

> - Checking for features using PageGetFeatureOffset() seems the wrong
> design to
> me - instead of a branch for some feature being disabled, perfectly
> predictable for the CPU, we need to do an external function call every
> time
> to figure out that yet, checksums are *still* disabled.
>

This is probably not a supported approach (it felt a little icky), but I'd
played around with const pointers to structs of const elements, where the
initial values of a global var was populated early on (so set once and
never changed post init), and the compiler didn't complain and things
seemed to work ok; not sure if this approach might help balance the early
mutability and constant lookup needs:

typedef struct PageFeatureOffsets {
const Size feature0offset;
const Size feature1offset;
...
} PageFeatureOffsets;

PageFeatureOffsets offsets = {0};
const PageFeatureOffsets *exposedOffsets = &offsets;

void InitOffsets() {
*((Size*)&offsets.feature0offset) = ...;
*((Size*)&offsets.feature1offset) = ...;
...
}

- Recomputing offsets every time in PageGetFeatureOffset() seems too
> expensive. The offsets can't change while running as
> PageGetFeatureOffset()
> have enough information to distinguish between different kinds of
> relations
>

Yes, this was a simple approach for ease of implementation; there is
certainly a way to precompute a lookup table from the page feature bitmask
into the offsets themselves or otherwise precompute, turn from function
call into inline/macro, etc.

> - so why do we need to recompute offsets on every single page? I'd
> instead
> add a distinct offset variable for each feature.
>

This would work iff there is a single page feature set across all pages in
the cluster; I'm not sure we don't want more flexibility here.

> - Modifying every single PageInit() call doesn't make sense to me. That'll
> just create a lot of breakage for - as far as I can tell - no win.
>

This was a placeholder to allow different features depending on page type;
to keep things simple for now I just used the same values here, but we
could move this inside PageInit() instead (again, assuming single feature
set per cluster).

> - Why is it worth sacrificing space on every page to indicate which
> features
> were enabled? I think there'd need to be some convincing reasons for
> introducing such overhead.
>

The point here is if we can use either GCM authtag or stronger checksums
then we've gained the ability to authenticate the page contents at the cost
of reassigning those bits, in a way that would support variable
permutations of features for different relations or page types, if so
desired. A single global setting here both eliminates that possibility as
well as requires external data in order to fully interpret pages.

> - Is it really useful to encode the set of features enabled in a cluster
> with
> a bitmask? That pretty much precludes utilizing extra page space in
> extensions. We could instead just have an extra cluster-wide file that
> defines a mapping of offset to feature.
>

Given the current design, yes we do need that, which does make it harder to
allocate/use from an extension. Due to needing to have consistent offsets
for a given feature set (however represented on a page), the implementation
load going forward as-is involves ensuring that a given bit always maps to
the same offset in the page regardless of additional features available in
the future. So the 0'th bit if enabled would always map to the 8 byte chunk
at the end of the page, the 1st bit corresponds to some amount of space
prior to that, etc. I'm not sure how to get that property without some
sort of bitmap or otherwise indexed operation.

I get what you're saying as far as the more global approach, and while that
does lend itself to some nice properties in terms of extensibility, some of
the features (GCM tags in particular) need to be able to control the page
offset at a consistent location so we can decode the rest of the page
without knowing anything else.

Additionally, since the reserved space/page features are configured at
initdb time I am unclear how a given extension would even be able to stake
a claim here. ...though if we consider this a two-part problem, one of
space reservation and one of space usage, that part could be handled via
allocating more than the minimum in the reserved_page_space and allowing
unallocated page space to be claimed later via some sort of additional
functions/other hook. That opens up other questions though, tracking
whether said space has ever been initialized and what to do when first
accessing existing/new pages as one example.

Best,

David

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-11-13 20:04:05 Re: should check collations when creating partitioned index
Previous Message a.rybakina 2023-11-13 19:48:23 Re: POC, WIP: OR-clause support for indexes