Re: POC: make mxidoff 64 bits

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Maxim Orlov <orlovmg(at)gmail(dot)com>
Cc: wenhui qiu <qiuwenhuifx(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: make mxidoff 64 bits
Date: 2024-11-15 11:06:00
Message-ID: ca451a6a-5b0f-433c-8ab9-b28b7f480bd7@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13/11/2024 17:44, Maxim Orlov wrote:
> On Tue, 12 Nov 2024 at 02:31, Heikki Linnakangas <hlinnaka(at)iki(dot)fi
> <mailto:hlinnaka(at)iki(dot)fi>> wrote:
> On a different note, I'm surprised you're rewriting member segments
> from
> scratch, parsing all the individual member groups and writing them out
> again. There's no change to the members file format, except for the
> numbering of the files, so you could just copy the files under the new
> names without paying attention to the contents. It's not wrong to parse
> them in detail, but I'd assume that it would be simpler not to.
>
> Yes, at the beginning I also thought that it would be possible to get by
> with simple copying. But in case of wraparound, we must "bypass" invalid
> zero offset value. See, old 32 bit offsets a wrapped at 2^32, thus 0
> values appears in multixact.c So, they must be handled. Bypass, in fact.
> When we are switched to the 64-bit offsets, we have two options:
> 1). Bypass every ((uint32) offset == 0) value in multixact.c;
> 2). Convert members and bypass invalid value once.
>
> The first options seem too weird for me. So, we have to repack members
> and bypass invalid value.

Hmm, so if I understand correctly, this is related to how we determine
the length of the members array, by looking at the next multixid's
offset. This is explained in GetMultiXactIdMembers:

> /*
> * Find out the offset at which we need to start reading MultiXactMembers
> * and the number of members in the multixact. We determine the latter as
> * the difference between this multixact's starting offset and the next
> * one's. However, there are some corner cases to worry about:
> *
> * 1. This multixact may be the latest one created, in which case there is
> * no next one to look at. In this case the nextOffset value we just
> * saved is the correct endpoint.
> *
> * 2. The next multixact may still be in process of being filled in: that
> * is, another process may have done GetNewMultiXactId but not yet written
> * the offset entry for that ID. In that scenario, it is guaranteed that
> * the offset entry for that multixact exists (because GetNewMultiXactId
> * won't release MultiXactGenLock until it does) but contains zero
> * (because we are careful to pre-zero offset pages). Because
> * GetNewMultiXactId will never return zero as the starting offset for a
> * multixact, when we read zero as the next multixact's offset, we know we
> * have this case. We handle this by sleeping on the condition variable
> * we have just for this; the process in charge will signal the CV as soon
> * as it has finished writing the multixact offset.
> *
> * 3. Because GetNewMultiXactId increments offset zero to offset one to
> * handle case #2, there is an ambiguity near the point of offset
> * wraparound. If we see next multixact's offset is one, is that our
> * multixact's actual endpoint, or did it end at zero with a subsequent
> * increment? We handle this using the knowledge that if the zero'th
> * member slot wasn't filled, it'll contain zero, and zero isn't a valid
> * transaction ID so it can't be a multixact member. Therefore, if we
> * read a zero from the members array, just ignore it.
> *
> * This is all pretty messy, but the mess occurs only in infrequent corner
> * cases, so it seems better than holding the MultiXactGenLock for a long
> * time on every multixact creation.
> */

With 64-bit offsets, can we assume that it never wraps around? We often
treat 2^64 as "large enough that we'll never run out", e.g. LSNs are
also assumed to never wrap around. I think that would be a safe
assumption here too.

If we accept that, we don't need to worry about case 3 anymore. But if
we upgrade wrapped-around members files by just renaming them, there
could still be a members array where we had skipped offset 0, and
reading that after the upgrade might get confused. We could continue to
ignore a 0 XID in the members array like the comment says; I think that
would be enough. But yeah, maybe it's better to bite the bullet in
pg_upgrade and squeeze those out.

Does your upgrade test suite include case 3, where the next multixact's
offset is 1?

Can we remove MaybeExtendOffsetSlru() now? There are a bunch of other
comments and checks that talk about binary-upgraded values too that we
can hopefully clean up now.

If we are to parse the member segments in detail in upgrade anyway, I'd
be tempted to make some further changes / optimizations:

- You could leave out all locking XID members in upgrade, because
they're not relevant after upgrade any more (all the XIDs will be
committed or aborted and have released the locks; we require prepared
transactions to be completed before upgrading too). It'd be enough to
include actual UPDATE/DELETE XIDs.

- The way we determine the length of the members array by looking at the
next multixid's offset is a bit complicated. We could have one extra
flag per XID in the members to indicate "this is the last member of this
multixid". That could either to replace the current mechanism of looking
at the next offset, or be just an additional cross-check.

- Do we still like the "group" representation, with 4 bytes of flags
followed by 4 XIDs? I wonder if it'd be better to just store 5 bytes per
XID unaligned.

- A more radical idea: There can be only one updating XID in one
multixid. We could store that directly in the offsets SLRU, and keep
only the locking XIDs in members. That way, the members SLRU would
become less critical; it could be safely reset on crash for example
(except for prepared transactions, which could still be holding locks,
but it'd still be less serious). Separating correctness-critical data
from more ephemeral state is generally a good idea.

I'm not insisting on any of these changes, just some things that might
be worth considering if we're rewriting the SLRUs on upgrade anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2024-11-15 11:14:51 Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
Previous Message 曾满 2024-11-15 11:05:33 [PATCH] Fixed assertion issues in "pg_get_viewdef"