Re: CSN snapshots in hot standby

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CSN snapshots in hot standby
Date: 2024-11-20 13:33:01
Message-ID: CANWCAZaxwGfZxD1KQb88_eLUPj_yx7Sc3B1y6-kjVN2g7wy1iw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 29, 2024 at 11:34 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> master patched
> few-xacts: 0.0041 0.0041 s / iteration
> many-xacts: 0.0042 0.0042 s / iteration
> many-xacts-wide-apart: 0.0043 0.0045 s / iteration

Hi Heikki,

I have some thoughts about behavior of the cache that might not be
apparent in this test:

The tree is only as tall as need be to store the highest non-zero
byte. On a newly initialized cluster, the current txid is small. The
first two test cases here will result in a tree with height of 2. The
last one will have a height of 3, and its runtime looks a bit higher,
although that could be just noise or touching more cache lines. It
might be worth it to try a test run while forcing the upper byte of
the keys to be non-zero (something like "key | (1<<30), so that the
tree always has a height of 4. That would match real-world conditions
more closely. If need be, there are a couple things we can do to
optimize node dispatch and touch fewer cache lines.

> I added two tests to the test suite:
> master patched
> insert-all-different-xids: 0.00027 0.00019 s / iteration
> insert-all-different-subxids: 0.00023 0.00020 s / iteration

> The point of these new tests is to test the scenario where the cache
> doesn't help and just adds overhead, because each XID is looked up only
> once. Seems to be fine. Surprisingly good actually; I'll do some more
> profiling on that to understand why it's even faster than 'master'.

These tests use a sequential scan. For things like primary key
lookups, I wonder if the overhead of creating and destroying the
tree's memory contexts for the (not used again) cache would be
noticeable. If so, it wouldn't be too difficult to teach radix tree to
create the larger contexts lazily.

> Now the downside of this new cache: Since it has no size limit, if you
> keep looking up different XIDs, it will keep growing until it holds all
> the XIDs between the snapshot's xmin and xmax. That can take a lot of
> memory in the worst case. Radix tree is pretty memory efficient, but
> holding, say 1 billion XIDs would probably take something like 500 MB of
> RAM (the radix tree stores 64-bit words with 2 bits per XID, plus the
> radix tree nodes). That's per snapshot, so if you have a lot of
> connections, maybe even with multiple snapshots each, that can add up.
>
> I'm inclined to accept that memory usage. If we wanted to limit the size
> of the cache, would need to choose a policy on how to truncate it
> (delete random nodes?), what the limit should be etc. But I think it'd
> be rare to hit those cases in practice. If you have a one billion XID
> old transaction running in the primary, you probably have bigger
> problems already.

I don't have a good sense of whether it needs a limit or not, but if
we decide to add one as a precaution, maybe it's enough to just blow
the cache away when reaching some limit? Being smarter than that would
need some work.

--
John Naylor
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Vitaly Davydov 2024-11-20 13:40:45 Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Previous Message Marcos Pegoraro 2024-11-20 13:29:10 Re: proposal: schema variables