From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Snapshot related assert failure on skink |
Date: | 2025-03-21 10:28:23 |
Message-ID: | 46e1ad87-9547-444c-838d-808834cf185d@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 3/19/25 13:27, Tomas Vondra wrote:
> On 3/19/25 08:17, Heikki Linnakangas wrote:
>> On 19/03/2025 04:22, Tomas Vondra wrote:
>>> I kept stress-testing this, and while the frequency massively increased
>>> on PG18, I managed to reproduce this all the way back to PG14. I see
>>> ~100x more corefiles on PG18.
>>>
>>> That is not a proof the issue was introduced in PG14, maybe it's just
>>> the assert that was added there or something. Or maybe there's another
>>> bug in PG18, making the impact worse.
>>>
>>> But I'd suspect this is a bug in
>>>
>>> commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
>>> Author: Andres Freund <andres(at)anarazel(dot)de>
>>> Date: Mon Aug 17 21:07:10 2020 -0700
>>>
>>> snapshot scalability: cache snapshots using a xact completion
>>> counter.
>>>
>>> Previous commits made it faster/more scalable to compute snapshots.
>>> But not
>>> building a snapshot is still faster. Now that GetSnapshotData()
>>> does not
>>> maintain RecentGlobal* anymore, that is actually not too hard:
>>>
>>> ...
>>
>> Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
>> and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
>> This can happen during hot standby:
>>
>> 1. Backend acquires snapshot A with xmin 1000
>> 2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
>> 3. Backend acquires snapshot B with xmin 1050
>> 4. Backend releases snapshot A, updating TransactionXmin to 1050
>> 5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
>> snapshot A's data.
>>
>> Because xactCompletionCount is not updated in step 2, the
>> GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
>> has a lower xmin.
>>
>
> Could be. As an experiment I added xactCompletionCount advance to the
> two functions you mentioned, and I ran the stress test again. I haven't
> seen any failures so far, after ~1000 runs. Without the patch this
> produced ~200 failures/core files.
>
I kept stress-testing this (without the fix), trying to figure out why
the frequency of failures got so much higher on PG18. While discussing
this off-list with Andres, he was wondering if maybe there's a second
independent bug on PG18, with the same symptoms.
It took quite a bit of time, but I managed to narrow this down to ~mid
December 2024, commit 952365cded6. The following table shows the number
of assert failures on a single run of the stress test, on 4 different
machines.
xeon ryzen rpi5-32 rpi5-64
-------------------------------------------------------
1585ff7387d 247 52 69 28
952365cded6 199 30 - -
7ec4b9ff80d 1 7 - -
7f97b4734f9 3 17 - -
578a7fe7b6f - - - -
db448ce5ad3 - - - -
1f81b48a9d5 - - - -
d5a7bd5670c 0 11 1 0
There's a fair amount of randomness - partially due to the bug being a
race condition, and thus time-sensitive. And then also the test suite is
randomized.
But I think it pretty clearly shows a clear change in these commits. The
rpi5 machines take much longer, so I only have the two results for now.
But it seems it changed in 952365cded6, which is:
commit 952365cded635e54c4177399c0280cb7a5e34c11
Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Date: Mon Dec 23 12:42:39 2024 +0200
Remove unnecessary GetTransactionSnapshot() calls
In get_database_list() and get_subscription_list(), the
GetTransactionSnapshot() call is not required because the catalog
table scans use the catalog snapshot, which is held until the end of
the scan. See table_beginscan_catalog(), which calls
RegisterSnapshot(GetCatalogSnapshot(relid)).
...
I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Ni Ku | 2025-03-21 10:30:47 | Re: Changing shared_buffers without restart |
Previous Message | Álvaro Herrera | 2025-03-21 10:18:00 | Re: support virtual generated column not null constraint |