Quick Links

Re: CSN snapshots in hot standby

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CSN snapshots in hot standby
Date:	2024-10-21 17:32:33
Message-ID:	7cd97248-fd04-4fb2-9f7f-096c1d7660e8@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 24/09/2024 21:08, Andres Freund wrote:
> I'd like to see some numbers for a workload with many overlapping top-level
> transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
> whereas this patch would need to do csn lookups. And a four entry cache
> probably wouldn't help very much.

I spent some more on the tests. Here is a better set of adversarial
tests, which hit the worst case scenarios for this patch.

All the test scenarios have this high-level shape:

1. Create a table with 100000 rows, vacuum freeze it.

2. In primary, open transactions or subtransactions, and DELETE all rows
using the different (sub)transactions, to set the xmax of every row on
the test table. Leave the transactions open.

3. In standby, SELECT COUNT(*) all rows in the table, and measure how
long it takes.

The difference between the test scenarios is in the pattern of xmax
values, i.e. how many transactions or subtransactions were used. All the
rows are visible, the performance differences come just from how
expensive the visibility checks are in different cases.

First, the results on 'master' without patches (smaller is better):

few-xacts: 0.0041 s / iteration
many-xacts: 0.0042 s / iteration
many-xacts-wide-apart: 0.0042 s / iteration
few-subxacts: 0.0042 s / iteration
many-subxacts: 0.0073 s / iteration
many-subxacts-wide-apart: 0.10 s / iteration

So even on master, there are significant differences depending on
whether the sub-XIDs fit in the in-memory caches, or if you need to do
lookups in pg_subtrans. That's not surprising. Note how bad the
"many-subxacts-wide-apart" scenario is, though. It's over 20x slower
than the best case scenario! I was a little taken aback by that. More on
that later.

Descriptions of the test scenarios:

few-xacts: The xmax values on the rows cycle through four different
XIDs, like this: 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...

many-xacts: like 'few-xacts', but cycle through 100 different XIDs.

many-xacts-wide-apart: like 'many-xacts', but the XIDs used are spread
out, so that there are 1000 unrelated committed XIDs in between each XID
used in the test table. I.e. "1000, 2000, 3000, 4000, 5000, ...". It
doesn't make a difference in the 'many-xacts-wide-apart' test, but in
the many-subxacts-wide-apart variant it does. It makes the XIDs fall on
different SLRU pages so that there are not enough SLRU buffers to hold
them all.

few-subxacts, many-subxacts, many-subxacts-wide-apart: Same tests, but
instead of using different top-level XIDs, all the XIDs are
subtransactions belonging to a single top-level XID.

Now, with the patch (the unpatched numbers are repeated here for
comparison):

master patched
few-xacts: 0.0041 0.0040 s / iteration
many-xacts: 0.0042 0.0053 s / iteration
many-xacts-wide-apart: 0.0042 0.17 s / iteration
few-subxacts: 0.0042 0.0040 s / iteration
many-subxacts: 0.0073 0.0052 s / iteration
many-subxacts-wide-apart: 0.10 0.22 s / iteration

So when the 4-element cache is effective, in the 'few-xacts' case, the
patch performs well. In the 'many-xacts' case, it needs to perform CSN
lookups, making it a little slower. The 'many-xacts-wide-apart'
regresses badly, showing the same SLRU trashing effect on CSN lookups as
the 'many-subxacts-wide-apart' case does on 'master' on pg_subtrans lookups.

Some thoughts on all this:

1. The many-subxacts-wide-apart performance is horrible even on master.
'perf' shows that about half of the CPU time is spent in open() and
close(). We open and close the SLRU file every time we need to read a
page! That's obviously silly, but also shouldn't be hard to fix.

2. Even if we fix the open/close issue and make the worst case 2x
faster, the worst case is still bad. We could call this a tuning issue;
more SLRU buffers helps. But that's not very satisfactory. I really wish
we could make SLRU buffers auto-tuning. Move them to the main buffer
pool. Or something. And I wish SLRU lookups were faster even in the case
that the SLRU page is already in memory. The LWLock acquire+release
shows up in profiles, maybe we could do some kind of optimistic locking
instead.

3. Aside from making SLRUs faster, we could also mask its slowness in
the CSN patch by caching. The 4-element cache in Snapshot that I
implemented is fast when it's sufficient, but we could make it larger to
cover more cases. At the extreme, we could never remove elements from
it, and just let it grow as large as needed.

4. Currently on 'master', the XID list in a snapshot is an array of XIDs
that is binary searched. A different data structure might be better.
When the difference between xmin and xmax is small, a bitmap would be
compact and fast to look up, for example. Or maybe a radix tree or
something. This is an independent optimization that might make
XidInMVCCSnapshot() faster even without the CSN stuff, but if we decide
to go with a large cache (see previous paragraph), it would be nice to
reduce the worst case memory usage of the cache with something like this.

5. I'm not sure how much any of this matters in practice. Performance is
obviously important, but we don't get too many complaints about these
things even though the 'many-subxacts-wide-apart' case is pretty bad
already. It was not that easy to construct these adversarial scenarios.
If we were implementing this from scratch, I think we could easily
accept the performance with the patch. Regressions can be very
unpleasant for existing users, however..

Thoughts? I think the fastest way to make progress with the CSN patch is
to make the cache larger, to hide the SLRU lookups. But those other
things would be interesting to explore too.

>> +/*
>> + * Record commit LSN of a transaction and its subtransaction tree.
>> + *
>> + * xid is a single xid to set status for. This will typically be the top level
>> + * transaction ID for a top level commit.
>> + *
>> + * subxids is an array of xids of length nsubxids, representing subtransactions
>> + * in the tree of xid. In various cases nsubxids may be zero.
>> + *
>> + * commitLsn is the LSN of the commit record. This is currently never called
>> + * for aborted transactions.
>> + */
>> +void
>> +CSNLogSetCSN(TransactionId xid, int nsubxids, TransactionId *subxids,
>> + XLogRecPtr commitLsn)
>> +{
>> + int pageno;
>> + int i = 0;
>> + int offset = 0;
>> +
>> + Assert(TransactionIdIsValid(xid));
>> +
>> + pageno = TransactionIdToPage(xid); /* get page of parent */
>> + for (;;)
>> + {
>> + int num_on_page = 0;
>> +
>> + while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
>> + {
>> + num_on_page++;
>> + i++;
>> + }
>
>
> Hm - is there any guarantee / documented requirement that subxids is sorted?

Yes, subtransaction XIDs are assigned in order. But point taken; I'll
add a comment of that here too.

>
>> + CSNLogSetPageStatus(xid,
>> + num_on_page, subxids + offset,
>> + commitLsn, pageno);
>> + if (i >= nsubxids)
>> + break;
>> +
>> + offset = i;
>> + pageno = TransactionIdToPage(subxids[offset]);
>> + xid = InvalidTransactionId;
>> + }
>> +}
>
>
> Hm. Maybe I'm missing something, but what prevents a concurrent transaction to
> check the visibility of a subtransaction between marking the subtransaction
> committed and marking the main transaction committed? If subtransaction and
> main transaction are on the same page that won't be possible, but if they are
> on different ones it does seem possible?
>
> Today XidInMVCCSnapshot() will use pg_subtrans to find the top transaction in
> case of a suboverflowed snapshot, but with this patch that's not the case
> anymore. Which afaict will mean that repeated snapshot computations could
> give different results for the same query?

The concurrent transaction's snapshot would consider the transaction and
all its subtransactions as still in-progress. Depending on the timing,
when XidInMVCCSnapshot() looks up the CSN of the subtransaction XID, it
will either see that it has no CSN which means it's still in progress
and thus invisible, or it has a CSN that is greater than the snapshot's
CSN, and invisible because of that. The global latestCommitLSN value is
advanced only after CSNLogSetCSN() has finished setting the CSN on all
the subtransactions, so before that, there cannot be any snapshots that
would see it as visible yet.

No code changes since v2, except the tests and minor comments, but I'm
including all the patches here again for convenience.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment	Content-Type	Size
v3-0001-XXX-add-perf-test.patch	text/x-patch	10.4 KB
v3-0002-Use-CSN-snapshots-during-Hot-Standby.patch	text/x-patch	128.8 KB
v3-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch	text/x-patch	6.2 KB
v3-0004-Remove-the-now-unused-xids-array-from-xl_running_.patch	text/x-patch	7.0 KB
v3-0005-Add-a-small-cache-to-Snapshot-to-avoid-CSN-lookup.patch	text/x-patch	2.5 KB

In response to

Re: CSN snapshots in hot standby at 2024-09-24 18:08:27 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Rahila Syed	2024-10-21 18:24:21	Enhancing Memory Context Statistics Reporting
Previous Message	Michel Pelletier	2024-10-21 17:23:33	Re: Using Expanded Objects other than Arrays from plpgsql