Re: CSN snapshots in hot standby

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CSN snapshots in hot standby
Date: 2024-12-03 14:25:13
Message-ID: 718d1788-b058-40e6-bc37-8f15612b5646@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20/11/2024 15:33, John Naylor wrote:
> The tree is only as tall as need be to store the highest non-zero
> byte. On a newly initialized cluster, the current txid is small. The
> first two test cases here will result in a tree with height of 2. The
> last one will have a height of 3, and its runtime looks a bit higher,
> although that could be just noise or touching more cache lines. It
> might be worth it to try a test run while forcing the upper byte of
> the keys to be non-zero (something like "key | (1<<30), so that the
> tree always has a height of 4. That would match real-world conditions
> more closely. If need be, there are a couple things we can do to
> optimize node dispatch and touch fewer cache lines.

Good point. In some quick testing with the 'few-xacts' test, the
difference between the worst case and best case is about 10%. That can
be avoided very easily: instead of using the xid as the key, use (xmax -
xid). That way, the highest key used is the distance between the
snapshot's xmin and xmax rather than the absolute xid values.

>> I added two tests to the test suite:
>> master patched
>> insert-all-different-xids: 0.00027 0.00019 s / iteration
>> insert-all-different-subxids: 0.00023 0.00020 s / iteration
>
>> The point of these new tests is to test the scenario where the cache
>> doesn't help and just adds overhead, because each XID is looked up only
>> once. Seems to be fine. Surprisingly good actually; I'll do some more
>> profiling on that to understand why it's even faster than 'master'.
>
> These tests use a sequential scan. For things like primary key
> lookups, I wonder if the overhead of creating and destroying the
> tree's memory contexts for the (not used again) cache would be
> noticeable. If so, it wouldn't be too difficult to teach radix tree to
> create the larger contexts lazily.

I played with that a little, but it doesn't make much difference. It is
measurable when you look at the profile with 'perf': XidInMVCCSnapshot()
takes about 4% of CPU time in the worst case, where you perform a very
simple SeqScan over a table with just one row, and you perform only one
visibility check in each scan. I could squeeze that down to around 3% by
allocating the contexts lazily. But that's not significant in the grand
scheme of things. Might still be worth doing, but it's not a blocker for
this work, it can be discussed separately.

I did find one weird thing that makes a big difference: I originally
used AllocSetContextCreate(..., ALLOCSET_DEFAULT_SIZES) for the radix
tree's memory context. With that, XidInMVCCSnapshot() takes about 19% of
the CPU time in that test. When I changed that to ALLOCSET_SMALL_SIZES,
it falls down to the 4% figure. And weird enough, in both cases the time
seems to be spent in the malloc() call from SlabContextCreate(), not
AllocSetContextCreate(). I think doing this particular mix of large and
small allocations with malloc() somehow poisons its free list or
something. So this is probably heavily dependent on the malloc()
implementation. In any case, ALLOCSET_SMALL_SIZES is clearly a better
choice here, even without that effect.

One way to eliminate all that would be to start with a tiny cache with
e.g. 4 elements with linear probing, and switch to the radix tree only
when that fills up. But it doesn't seem worth the extra code right now.

Here's a new version with the those small changes:
- use xmax - xid as the cache key
- use ALLOCSET_SMALL_SIZES for the radix tree's memory context

Thanks for looking into this!

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
v5-0001-XXX-add-perf-test.patch text/x-patch 12.0 KB
v5-0002-Use-CSN-snapshots-during-Hot-Standby.patch text/x-patch 128.8 KB
v5-0003-Make-SnapBuildWaitSnapshot-work-without-xl_runnin.patch text/x-patch 6.2 KB
v5-0004-Remove-the-now-unused-xids-array-from-xl_running_.patch text/x-patch 7.0 KB
v5-0005-Add-a-cache-to-Snapshot-to-avoid-repeated-CSN-loo.patch text/x-patch 6.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-12-03 14:31:19 Re: Changing shared_buffers without restart
Previous Message jian he 2024-12-03 14:15:44 Re: Virtual generated columns