From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Snapshot related assert failure on skink |
Date: | 2025-03-24 15:25:54 |
Message-ID: | 76bcbe27-4408-4f01-8623-4a6eff782312@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 24/03/2025 16:56, Tomas Vondra wrote:
>
>
> On 3/23/25 17:43, Heikki Linnakangas wrote:
>> On 21/03/2025 17:16, Andres Freund wrote:
>>> Am I right in understanding that the only scenario (when in
>>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
>>> would
>>> "legally" remove a transaction, rather than the commit / abort records
>>> doing
>>> so, is if the primary crash-restarted while transactions were ongoing?
>>>
>>> Those transactions won't have a commit/abort records and thus won't
>>> trigger
>>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
>>> updated
>>> ->xactCompletionCount?
>>
>> Correct.
>>
>>> When writing the snapshot caching patch, I tried to make sure that all
>>> the
>>> places that maintain ->latestCompletedXid also update
>>> ->xactCompletionCount. Afaict that's still the case. Which, I think,
>>> means
>>> that we're also missing calls to MaintainLatestCompletedXidRecovery()?
>>
>> Yep, I was just looking at that too.
>>
>>> If latestCompletedXid is incorrect visibility determinations end up
>>> wrong...
>>
>> I think it happens to work, because the transactions are effectively
>> aborted. latestCompletedXid is used to initialize xmax in
>> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
>> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
>> still correctly considers XID 1001 as "not visible". As soon as a
>> transaction commits, latestCompletedXid is fixed.
>>
>> AFAICS we could skip updating latestCompletedXid on aborts altogether
>> and rename it to latestCommittedXid. But it's hardly worth optimizing
>> for aborts.
>>
>> For the same reason, I believe the assertion failure we're discussing
>> here is also harmless. Even though the reused snapshot has a smaller
>> xmin than expected, all those transactions aborted and are thus not
>> visible anyway.
>>
>> In any case, let's be tidy and fix both latestCompletedXid and
>> xactCompletionCount.
>>
>
> Thanks for looking into this and pushing the fix.
>
> Would it make sense to add a comment documenting this reasoning about
> not handling aborts? Otherwise someone will get to rediscover this in
> the future ...
Well, we're currently _not_ relying on the reasoning about aborts not
changing visibility. So it's not like someone will forget the reasoning
and have a bug as a result. I guess someone could rediscover that
reasoning and think that they can skip advancing those counters on
aborts as an optimization, re-introducing the assertion failure. But I
believe that was not why we had this bug, it was just a simple omission.
So the comment would look like this:
"In principle, we would not need to advance xactCompletionCount and
latestCompletedXid because aborting transactions don't change
visibility. But that could make xmin within a transaction go backwards,
if a snapshot with an older xmin but same xactCompletionCount is reused,
triggering the assertion in GetSnapshotDataReuse()."
If we add that, I guess it should go into ProcArrayEndTransaction()
which currently just notes that it's used for commits and aborts
interchangeably. Do you think it's worth it? Want to propose a patch?
--
Heikki Linnakangas
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2025-03-24 15:30:47 | Re: vacuum_truncate configuration parameter and isset_offset |
Previous Message | Robert Haas | 2025-03-24 15:25:44 | Re: vacuum_truncate configuration parameter and isset_offset |