Re: Snapshot related assert failure on skink

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Snapshot related assert failure on skink
Date: 2025-03-19 07:17:23
Message-ID: 605d6217-1050-43c8-83f5-7c52598c54cc@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19/03/2025 04:22, Tomas Vondra wrote:
> I kept stress-testing this, and while the frequency massively increased
> on PG18, I managed to reproduce this all the way back to PG14. I see
> ~100x more corefiles on PG18.
>
> That is not a proof the issue was introduced in PG14, maybe it's just
> the assert that was added there or something. Or maybe there's another
> bug in PG18, making the impact worse.
>
> But I'd suspect this is a bug in
>
> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
> Author: Andres Freund <andres(at)anarazel(dot)de>
> Date: Mon Aug 17 21:07:10 2020 -0700
>
> snapshot scalability: cache snapshots using a xact completion counter.
>
> Previous commits made it faster/more scalable to compute snapshots.
> But not
> building a snapshot is still faster. Now that GetSnapshotData() does not
> maintain RecentGlobal* anymore, that is actually not too hard:
>
> ...

Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
This can happen during hot standby:

1. Backend acquires snapshot A with xmin 1000
2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
3. Backend acquires snapshot B with xmin 1050
4. Backend releases snapshot A, updating TransactionXmin to 1050
5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
snapshot A's data.

Because xactCompletionCount is not updated in step 2, the
GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
has a lower xmin.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2025-03-19 07:31:15 Re: pg_stat_statements and "IN" conditions
Previous Message Zhijie Hou (Fujitsu) 2025-03-19 07:14:14 RE: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.