Re: CSN snapshots in hot standby

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CSN snapshots in hot standby
Date: 2024-11-15 19:16:13
Message-ID: 39f9fe48-4db9-4daa-b4c5-c6f46ac92597@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/10/2024 18:33, Heikki Linnakangas wrote:
> I added two tests to the test suite:
>                                 master     patched
> insert-all-different-xids:     0.00027    0.00019 s / iteration
> insert-all-different-subxids:  0.00023    0.00020 s / iteration
>
> insert-all-different-xids: Open 1000 connections, insert one row in
> each, and leave the transactions open. In the replica, select all the rows
>
> insert-all-different-subxids: The same, but with 1 transaction with 1000
> subxids.
>
> The point of these new tests is to test the scenario where the cache
> doesn't help and just adds overhead, because each XID is looked up only
> once. Seems to be fine. Surprisingly good actually; I'll do some more
> profiling on that to understand why it's even faster than 'master'.

Ok, I did some profiling and it makes sense:

In the insert-all-different-xids test on 'master', we spend about 60& of
CPU time in XidInMVCCSnapshot(), doing pg_lfind32() over the subxip
array. We should probably sort the array and use a binary search if it's
large or something...

With these patches, instead of the pg_lfind32() over subxip array, we
perform one CSN SLRU lookup instead, and the page is cached. There's
locking overhead etc. with that, but it's still cheaper than the
pg_lfind32().

In the insert-all-different-subxids test on 'master', the subxip array
is overflowed, so we call SubTransGetTopmostTransaction() on each XID.
That's performs two pg_subtrans lookups for each XID, first for the
subxid, then for the parent. With these patches, we perform just one
SLRU lookup, in pg_csnlog, which is faster.

> Now the downside of this new cache: Since it has no size limit, if you
> keep looking up different XIDs, it will keep growing until it holds all
> the XIDs between the snapshot's xmin and xmax. That can take a lot of
> memory in the worst case. Radix tree is pretty memory efficient, but
> holding, say 1 billion XIDs would probably take something like 500 MB of
> RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
> radix tree nodes). That's per snapshot, so if you have a lot of 60&
> connections, maybe even with multiple snapshots each, that can add up.
>
> I'm inclined to accept that memory usage. If we wanted to limit the size
> of the cache, would need to choose a policy on how to truncate it
> (delete random nodes?), what the limit should be etc. But I think it'd
> be rare to hit those cases in practice. If you have a one billion XID
> old transaction running in the primary, you probably have bigger
> problems already.

I'd love to hear some thoughts on this caching behavior. Is it
acceptable to let the cache grow, potentially to very large sizes in the
worst cases? Or do we need to make it more complicated and implement
some eviction policy?

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-11-15 20:29:21 Re: Potential ABI breakage in upcoming minor releases
Previous Message Noah Misch 2024-11-15 19:03:41 Re: Potential ABI breakage in upcoming minor releases