From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Snapshot related assert failure on skink |
Date: | 2025-03-19 12:27:31 |
Message-ID: | c72e360d-b363-4cd7-a299-4ee41b193d94@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 3/19/25 08:17, Heikki Linnakangas wrote:
> On 19/03/2025 04:22, Tomas Vondra wrote:
>> I kept stress-testing this, and while the frequency massively increased
>> on PG18, I managed to reproduce this all the way back to PG14. I see
>> ~100x more corefiles on PG18.
>>
>> That is not a proof the issue was introduced in PG14, maybe it's just
>> the assert that was added there or something. Or maybe there's another
>> bug in PG18, making the impact worse.
>>
>> But I'd suspect this is a bug in
>>
>> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
>> Author: Andres Freund <andres(at)anarazel(dot)de>
>> Date: Mon Aug 17 21:07:10 2020 -0700
>>
>> snapshot scalability: cache snapshots using a xact completion
>> counter.
>>
>> Previous commits made it faster/more scalable to compute snapshots.
>> But not
>> building a snapshot is still faster. Now that GetSnapshotData()
>> does not
>> maintain RecentGlobal* anymore, that is actually not too hard:
>>
>> ...
>
> Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
> and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
> This can happen during hot standby:
>
> 1. Backend acquires snapshot A with xmin 1000
> 2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
> 3. Backend acquires snapshot B with xmin 1050
> 4. Backend releases snapshot A, updating TransactionXmin to 1050
> 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
> snapshot A's data.
>
> Because xactCompletionCount is not updated in step 2, the
> GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
> has a lower xmin.
>
Could be. As an experiment I added xactCompletionCount advance to the
two functions you mentioned, and I ran the stress test again. I haven't
seen any failures so far, after ~1000 runs. Without the patch this
produced ~200 failures/core files.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2025-03-19 12:28:44 | Re: Add -k/--link option to pg_combinebackup |
Previous Message | Ranier Vilela | 2025-03-19 12:10:00 | Re: Show WAL write and fsync stats in pg_stat_io |