Re: CSN snapshots in hot standby

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <johncnaylorls(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>
Subject: Re: CSN snapshots in hot standby
Date: 2025-03-31 21:31:43
Message-ID: 80f254d3-8ee9-4cde-a7e3-ee99998154da@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here's a new patchset version. Not much has changed in the actual CSN
patches. But I spent a lot of time refactoring the snapshot management
code, so that there is a simple place to add the "inprogress XID cache"
for the CSN snapshots, in a way that avoids duplicating the cache if a
snapshot is copied around.

Patches 0001-0002 are the patches I posted on a separate thread earlier.
See
https://www.postgresql.org/message-id/ec10d398-c9b3-4542-8095-5fc6408b17d1%40iki.fi.

Patches 0003-0006 contain more snapshot manager changes. The end state
is that an MVCC snapshot consists of two structs: a shared "inner"
struct that contains xmin, xmax and the XID lists, and an "outer" struct
that contains a pointer to the shared struct and the current command ID.
As a snapshot is copied around, all the copies share the same shared,
reference-counted struct.

The rest of the patches are the same CSN patches I posted before,
rebased over the snapshot manager changes.

There's one thing that hasn't been discussed yet: The
ProcArrayRecoveryEndTransaction() function, which replaces
ExpireTreeKnownAssignedTransactionIds() and is called on replay of every
commit/abort record, does this:

> /*
> * If this was the oldest XID that was still running, advance it. This is
> * important for advancing the global xmin, which avoids unnecessary
> * recovery conflicts
> *
> * No locking required because this runs in the startup process.
> *
> * XXX: the caller actually has a list of XIDs that just committed. We
> * could save some clog lookups by taking advantage of that list.
> */
> oldest_running_primary_xid = procArray->oldest_running_primary_xid;
> while (oldest_running_primary_xid < max_xid)
> {
> if (!TransactionIdDidCommit(oldest_running_primary_xid) &&
> !TransactionIdDidAbort(oldest_running_primary_xid))
> {
> break;
> }
> TransactionIdAdvance(oldest_running_primary_xid);
> }
> if (max_xid == oldest_running_primary_xid)
> TransactionIdAdvance(oldest_running_primary_xid);

The point is to maintain an "oldest xmin" value based on the WAL records
that are being replayed. Whenever the currently oldest running XID
finishes, we scan the CLOG to find the next oldest XID that hasn't
completed yet.

That adds approximately one or two CLOG lookup to every commit record
replay on average. I haven't tried measuring that, but it seems like it
could slow down recovery. There are ways that could be improved. For
example, do it in larger batches.

A bunch of other small XXX comments remain, but they're just markers for
comments that need to be adjusted, or for further cleanups that are now
possible.

There are also several ways the inprogress cache could be made more
efficient, which I haven't explored:

- For each XID in the cache radix tree, we store one bit to indicate
whether the lookup has been performed, i.e. if the cache is valid for
the XID, and another bit to indicate if the XID is visible or not. With
64-bit cache words stored in the radix tree, each cache word can store
the status of 32 transactions. It would probably be better to work in
bigger chunks. For example, when doing a lookup in the cache, check the
status of 64 transactions at once. Assuming they're all stored on the
same CSN page, it would not be much more expensive than a single XID
lookup. That would make the cache 2x more compact, and save on future
lookups of XIDS falling on the same cache word.

- Initializing the radix tree cache is fairly expensive, with several
memory allocations. Many of those allocations could be done lazily with
some effort in radixtree.h.

- Or start the cache as a small array of XIDs, and switch to the radix
tree only after it fills up.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
v6-0001-Split-SnapshotData-into-separate-structs-for-each.patch text/x-patch 85.4 KB
v6-0002-Simplify-historic-snapshot-refcounting.patch text/x-patch 13.2 KB
v6-0003-Add-an-explicit-valid-flag-to-MVCCSnapshotData.patch text/x-patch 4.0 KB
v6-0004-Replace-static-snapshot-pointers-with-the-valid-f.patch text/x-patch 13.9 KB
v6-0005-Make-RestoreSnapshot-register-the-snapshot-with-c.patch text/x-patch 3.1 KB
v6-0006-Replace-the-RegisteredSnapshot-pairing-heap-with-.patch text/x-patch 21.6 KB
v6-0007-Split-MVCCSnapshot-into-inner-and-outer-parts.patch text/x-patch 94.6 KB
v6-0008-XXX-add-perf-test.patch text/x-patch 12.0 KB
v6-0009-Use-CSN-snapshots-during-Hot-Standby.patch text/x-patch 129.5 KB
v6-0010-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch text/x-patch 6.2 KB
v6-0011-Remove-the-now-unused-xids-array-from-xl_running_.patch text/x-patch 6.9 KB
v6-0012-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch text/x-patch 7.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2025-03-31 21:54:30 Re: pgsql: Add support for OAUTHBEARER SASL mechanism
Previous Message Andrew Jackson 2025-03-31 21:26:27 Re: [PATCH] PGSERVICEFILE as part of a normal connection string