Re: CSN snapshots in hot standby

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CSN snapshots in hot standby
Date: 2024-10-29 16:33:53
Message-ID: b1955fa6-14b3-46cb-8796-6a2373b7220d@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21/10/2024 20:32, Heikki Linnakangas wrote:
> On 24/09/2024 21:08, Andres Freund wrote:
>> I'd like to see some numbers for a workload with many overlapping
>> top-level
>> transactions. I contrast to 2) HEAD wouldn't need to do subtrans lookups,
>> whereas this patch would need to do csn lookups. And a four entry cache
>> probably wouldn't help very much.
>
> I spent some more on the tests. Here is a better set of adversarial
> tests, which hit the worst case scenarios for this patch.
>
> All the test scenarios have this high-level shape:
>
> 1. Create a table with 100000 rows, vacuum freeze it.
>
> 2. In primary, open transactions or subtransactions, and DELETE all rows
> using the different (sub)transactions, to set the xmax of every row on
> the test table. Leave the transactions open.
>
> 3. In standby, SELECT COUNT(*) all rows in the table, and measure how
> long it takes.
>
> The difference between the test scenarios is in the pattern of xmax
> values, i.e. how many transactions or subtransactions were used. All the
> rows are visible, the performance differences come just from how
> expensive the visibility checks are in different cases.
>
> First, the results on 'master' without patches (smaller is better):
>
> few-xacts:                 0.0041 s / iteration
> many-xacts:                0.0042 s / iteration
> many-xacts-wide-apart:     0.0042 s / iteration
> few-subxacts:              0.0042 s / iteration
> many-subxacts:             0.0073 s / iteration
> many-subxacts-wide-apart:  0.10   s / iteration
>
> So even on master, there are significant differences depending on
> whether the sub-XIDs fit in the in-memory caches, or if you need to do
> lookups in pg_subtrans. That's not surprising. Note how bad the
> "many-subxacts-wide-apart" scenario is, though. It's over 20x slower
> than the best case scenario! I was a little taken aback by that. More on
> that later.
>
> Descriptions of the test scenarios:
>
> few-xacts: The xmax values on the rows cycle through four different
> XIDs, like this: 1001, 1002, 1003, 1004, 1001, 1002, 1003, 1004, ...
>
> many-xacts: like 'few-xacts', but cycle through 100 different XIDs.
>
> many-xacts-wide-apart: like 'many-xacts', but the XIDs used are spread
> out, so that there are 1000 unrelated committed XIDs in between each XID
> used in the test table. I.e. "1000, 2000, 3000, 4000, 5000, ...". It
> doesn't make a difference in the 'many-xacts-wide-apart' test, but in
> the many-subxacts-wide-apart variant it does. It makes the XIDs fall on
> different SLRU pages so that there are not enough SLRU buffers to hold
> them all.
>
> few-subxacts, many-subxacts, many-subxacts-wide-apart: Same tests, but
> instead of using different top-level XIDs, all the XIDs are
> subtransactions belonging to a single top-level XID.
>
>
> Now, with the patch (the unpatched numbers are repeated here for
> comparison):
>
>                            master     patched
> few-xacts:                 0.0041      0.0040 s / iteration
> many-xacts:                0.0042      0.0053 s / iteration
> many-xacts-wide-apart:     0.0042      0.17   s / iteration
> few-subxacts:              0.0042      0.0040 s / iteration
> many-subxacts:             0.0073      0.0052 s / iteration
> many-subxacts-wide-apart:  0.10        0.22   s / iteration
>
>
> So when the 4-element cache is effective, in the 'few-xacts' case, the
> patch performs well. In the 'many-xacts' case, it needs to perform CSN
> lookups, making it a little slower. The 'many-xacts-wide-apart'
> regresses badly, showing the same SLRU trashing effect on CSN lookups as
> the 'many-subxacts-wide-apart' case does on 'master' on pg_subtrans
> lookups.

Here's another version, which replaces the small 4-element cache with a
cache with no size limit. It's implemented as a radix tree and entries
are never removed, so it can grow to hold the status of all XIDs between
the snapshot's xmin and xmax at most.

This new cache solves the performance issue with the earlier tests:

master patched
few-xacts: 0.0041 0.0041 s / iteration
many-xacts: 0.0042 0.0042 s / iteration
many-xacts-wide-apart: 0.0043 0.0045 s / iteration
few-subxacts: 0.0043 0.0042 s / iteration
many-subxacts: 0.0076 0.0042 s / iteration
many-subxacts-wide-apart: 0.11 0.0070 s / iteration

The new cache also elides the slow pg_subtrans lookups that makes
many-subxacts-wide-apart case slow on 'master', which is nice.

I added two tests to the test suite:
master patched
insert-all-different-xids: 0.00027 0.00019 s / iteration
insert-all-different-subxids: 0.00023 0.00020 s / iteration

insert-all-different-xids: Open 1000 connections, insert one row in
each, and leave the transactions open. In the replica, select all the rows

insert-all-different-subxids: The same, but with 1 transaction with 1000
subxids.

The point of these new tests is to test the scenario where the cache
doesn't help and just adds overhead, because each XID is looked up only
once. Seems to be fine. Surprisingly good actually; I'll do some more
profiling on that to understand why it's even faster than 'master'.

Now the downside of this new cache: Since it has no size limit, if you
keep looking up different XIDs, it will keep growing until it holds all
the XIDs between the snapshot's xmin and xmax. That can take a lot of
memory in the worst case. Radix tree is pretty memory efficient, but
holding, say 1 billion XIDs would probably take something like 500 MB of
RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
radix tree nodes). That's per snapshot, so if you have a lot of
connections, maybe even with multiple snapshots each, that can add up.

I'm inclined to accept that memory usage. If we wanted to limit the size
of the cache, would need to choose a policy on how to truncate it
(delete random nodes?), what the limit should be etc. But I think it'd
be rare to hit those cases in practice. If you have a one billion XID
old transaction running in the primary, you probably have bigger
problems already.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
0001-XXX-add-perf-test.patch text/x-patch 12.0 KB
0002-Use-CSN-snapshots-during-Hot-Standby.patch text/x-patch 128.8 KB
0003-Make-SnapBuildWaitSnapshot-work-without-xl_running_x.patch text/x-patch 6.2 KB
0004-Remove-the-now-unused-xids-array-from-xl_running_xac.patch text/x-patch 7.0 KB
0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-lookup.patch text/x-patch 5.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thom Brown 2024-10-29 16:38:28 Re: MultiXact\SLRU buffers configuration
Previous Message Paul Ramsey 2024-10-29 16:23:21 Re: RFC: Extension Packaging & Lookup