Re: Snapshot related assert failure on skink

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Snapshot related assert failure on skink
Date: 2025-03-21 10:28:23
Message-ID: 46e1ad87-9547-444c-838d-808834cf185d@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/19/25 13:27, Tomas Vondra wrote:
> On 3/19/25 08:17, Heikki Linnakangas wrote:
>> On 19/03/2025 04:22, Tomas Vondra wrote:
>>> I kept stress-testing this, and while the frequency massively increased
>>> on PG18, I managed to reproduce this all the way back to PG14. I see
>>> ~100x more corefiles on PG18.
>>>
>>> That is not a proof the issue was introduced in PG14, maybe it's just
>>> the assert that was added there or something. Or maybe there's another
>>> bug in PG18, making the impact worse.
>>>
>>> But I'd suspect this is a bug in
>>>
>>> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
>>> Author: Andres Freund <andres(at)anarazel(dot)de>
>>> Date:   Mon Aug 17 21:07:10 2020 -0700
>>>
>>>      snapshot scalability: cache snapshots using a xact completion
>>> counter.
>>>
>>>      Previous commits made it faster/more scalable to compute snapshots.
>>> But not
>>>      building a snapshot is still faster. Now that GetSnapshotData()
>>> does not
>>>      maintain RecentGlobal* anymore, that is actually not too hard:
>>>
>>>      ...
>>
>> Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
>> and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
>> This can happen during hot standby:
>>
>> 1. Backend acquires snapshot A with xmin 1000
>> 2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
>> 3. Backend acquires snapshot B with xmin 1050
>> 4. Backend releases snapshot A, updating TransactionXmin to 1050
>> 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
>> snapshot A's data.
>>
>> Because xactCompletionCount is not updated in step 2, the
>> GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
>> has a lower xmin.
>>
>
> Could be. As an experiment I added xactCompletionCount advance to the
> two functions you mentioned, and I ran the stress test again. I haven't
> seen any failures so far, after ~1000 runs. Without the patch this
> produced ~200 failures/core files.
>

I kept stress-testing this (without the fix), trying to figure out why
the frequency of failures got so much higher on PG18. While discussing
this off-list with Andres, he was wondering if maybe there's a second
independent bug on PG18, with the same symptoms.

It took quite a bit of time, but I managed to narrow this down to ~mid
December 2024, commit 952365cded6. The following table shows the number
of assert failures on a single run of the stress test, on 4 different
machines.

xeon ryzen rpi5-32 rpi5-64
-------------------------------------------------------
1585ff7387d 247 52 69 28
952365cded6 199 30 - -
7ec4b9ff80d 1 7 - -
7f97b4734f9 3 17 - -
578a7fe7b6f - - - -
db448ce5ad3 - - - -
1f81b48a9d5 - - - -
d5a7bd5670c 0 11 1 0

There's a fair amount of randomness - partially due to the bug being a
race condition, and thus time-sensitive. And then also the test suite is
randomized.

But I think it pretty clearly shows a clear change in these commits. The
rpi5 machines take much longer, so I only have the two results for now.

But it seems it changed in 952365cded6, which is:

commit 952365cded635e54c4177399c0280cb7a5e34c11
Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Date: Mon Dec 23 12:42:39 2024 +0200

Remove unnecessary GetTransactionSnapshot() calls

In get_database_list() and get_subscription_list(), the
GetTransactionSnapshot() call is not required because the catalog
table scans use the catalog snapshot, which is held until the end of
the scan. See table_beginscan_catalog(), which calls
RegisterSnapshot(GetCatalogSnapshot(relid)).

...

I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ni Ku 2025-03-21 10:30:47 Re: Changing shared_buffers without restart
Previous Message Álvaro Herrera 2025-03-21 10:18:00 Re: support virtual generated column not null constraint