Re: Proposal to add page headers to SLRU pages

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Cc: "Li, Yong" <yoli(at)ebay(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Bagga, Rishu" <bagrishu(at)amazon(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Debnath, Shawn" <sdn(at)ebay(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, "Shyrabokau, Anton" <antons(at)ebay(dot)com>
Subject: Re: Proposal to add page headers to SLRU pages
Date: 2024-03-08 20:39:11
Message-ID: da8724ce71d2d7ef826a93ac61768ab46c8c1ca3.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2024-03-08 at 13:58 +0100, Alvaro Herrera wrote:
> In the new code we effectively store only one LSN per page, which I
> understand is strictly worse.

To quote from the README:

"Instead, we defer the setting of the hint bit if we have not yet
flushed WAL as far as the LSN associated with the transaction. This
requires tracking the LSN of each unflushed async commit."

In other words, the problem the group_lsns are solving is that we can't
set the hint bit on a tuple until the commit record for that
transaction has actually been flushed. For ordinary sync commit, that's
fine, because the CLOG bit isn't set until after the commit record is
flushed. But for async commit, the CLOG may be updated before the WAL
is flushed, and group_lsns are one way to track the right information
to hold off updating the hint bits.

"It is convenient to associate this data with clog buffers: because we
will flush WAL before writing a clog page, we know that we do not need
to remember a transaction's LSN longer than the clog page holding its
commit status remains in memory."

It's not clear to me that it is so convenient, if it's preventing the
SLRU from fitting in with the rest of the system architecture.

"The worst case is where a sync-commit xact shares a cached LSN with an
async-commit xact that commits a bit later; even though we paid to sync
the first xact to disk, we won't be able to hint its outputs until the
second xact is sync'd, up to three walwriter cycles later."

Perhaps we can revisit alternatives to the group_lsn? If we accept
Yong's proposal, and the SLRU has a normal LSN and was used in the
normal way, we would just need some kind of secondary structure to hold
a mapping from XID->LSN only for async transactions.

The characteristics we are looking for in this secondary structure are:

1. cheap to test if it's empty, so it doesn't interfere with a purely
sync workload at all
2. expire old entries (where the LSN has already been flushed)
cheaply enough so the data structure doesn't bloat
3. look up an LSN given an XID cheaply enough that it doesn't
interfere with setting hint bits

Making a better secondary structure seems doable to me. Just to
brainstorm:

* Have an open-addressed hash table mapping async XIDs to their
commit LSN. If you have a hash collision, opportunistically see if the
entry is old and can be removed. Try K probes, and if they are all
recent, then you need to XLogFlush. The table could get pretty big,
because it needs to hold enough async transactions for a wal writer
cycle or two, but it seems reasonable to make async workloads pay that
memory cost.

* Another idea, if the size of the structure is a problem, is to
group K async xids into a bloom filter that points at a single LSN.
When more transactions come along, create another bloom filter for the
next K async xids. This could interfere with setting hint bits for sync
xids if the bloom filters are undersized, but that doesn't seem like a
big problem.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-03-08 20:48:45 Re: Failures in constraints regression test, "read only 0 of 8192 bytes"
Previous Message Thomas Munro 2024-03-08 20:29:52 Re: Failures in constraints regression test, "read only 0 of 8192 bytes"