Re: Snapshot related assert failure on skink

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Snapshot related assert failure on skink
Date: 2025-03-24 21:25:41
Message-ID: 33c9a3ed-b850-4b21-ac6c-9d9b3d7fe155@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/24/25 16:25, Heikki Linnakangas wrote:
> On 24/03/2025 16:56, Tomas Vondra wrote:
>>
>>
>> On 3/23/25 17:43, Heikki Linnakangas wrote:
>>> On 21/03/2025 17:16, Andres Freund wrote:
>>>> Am I right in understanding that the only scenario (when in
>>>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
>>>> would
>>>> "legally" remove a transaction, rather than the commit / abort records
>>>> doing
>>>> so, is if the primary crash-restarted while transactions were ongoing?
>>>>
>>>> Those transactions won't have a commit/abort records and thus won't
>>>> trigger
>>>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
>>>> updated
>>>> ->xactCompletionCount?
>>>
>>> Correct.
>>>
>>>> When writing the snapshot caching patch, I tried to make sure that all
>>>> the
>>>> places that maintain ->latestCompletedXid also update
>>>> ->xactCompletionCount. Afaict that's still the case. Which, I think,
>>>> means
>>>> that we're also missing calls to MaintainLatestCompletedXidRecovery()?
>>>
>>> Yep, I was just looking at that too.
>>>
>>>> If latestCompletedXid is incorrect visibility determinations end up
>>>> wrong...
>>>
>>> I think it happens to work, because the transactions are effectively
>>> aborted. latestCompletedXid is used to initialize xmax in
>>> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
>>> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
>>> still correctly considers XID 1001 as "not visible". As soon as a
>>> transaction commits, latestCompletedXid is fixed.
>>>
>>> AFAICS we could skip updating latestCompletedXid on aborts altogether
>>> and rename it to latestCommittedXid. But it's hardly worth optimizing
>>> for aborts.
>>>
>>> For the same reason, I believe the assertion failure we're discussing
>>> here is also harmless. Even though the reused snapshot has a smaller
>>> xmin than expected, all those transactions aborted and are thus not
>>> visible anyway.
>>>
>>> In any case, let's be tidy and fix both latestCompletedXid and
>>> xactCompletionCount.
>>>
>>
>> Thanks for looking into this and pushing the fix.
>>
>> Would it make sense to add a comment documenting this reasoning about
>> not handling aborts? Otherwise someone will get to rediscover this in
>> the future ...
>
> Well, we're currently _not_ relying on the reasoning about aborts not
> changing visibility. So it's not like someone will forget the reasoning
> and have a bug as a result. I guess someone could rediscover that
> reasoning and think that they can skip advancing those counters on
> aborts as an optimization, re-introducing the assertion failure. But I
> believe that was not why we had this bug, it was just a simple omission.
>
> So the comment would look like this:
>
> "In principle, we would not need to advance xactCompletionCount and
> latestCompletedXid because aborting transactions don't change
> visibility. But that could make xmin within a transaction go backwards,
> if a snapshot with an older xmin but same xactCompletionCount is reused,
> triggering the assertion in GetSnapshotDataReuse()."
>
> If we add that, I guess it should go into ProcArrayEndTransaction()
> which currently just notes that it's used for commits and aborts
> interchangeably. Do you think it's worth it? Want to propose a patch?
>

I think the wording you suggested above is reasonable - I certainly
don't have a better one. I don't know what's the likelihood of someone
breaking this, but I think it makes sense to mention this because the
assert is non-trivial to hit.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2025-03-24 22:05:16 Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Previous Message Andrei Lepikhov 2025-03-24 21:23:18 Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment