Re: table_tuple_lock's snapshot argument

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: table_tuple_lock's snapshot argument
Date: 2025-03-10 15:22:44
Message-ID: 190b4c9f-8173-4f0d-9877-a7aebb7c41c9@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21/01/2025 12:05, Alexander Korotkov wrote:
> On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
>> On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> However, I think GetLatestSnapshot() is wrong here anyway. None of this
>>> matters for heapam which just ignores the 'snapshot' argument, but let's
>>> assume a different AM that would use the snapshot to identify the tuple
>>> version. The TID was fetched from an index using an index scan with
>>> SnapshotDirty. There's no guarantee that the LatestSnapshot would match
>>> the same tuple version that the index scan found. If an MVCC snapshot is
>>> needed, surely it should be acquired before the index scan, and used for
>>> the index scan as well.
>>>
>>> I see three options:
>>>
>>> 1. Remove the 'snapshot' argument, since it's not used by heapam. If we
>>> get a table AM where a single TID represents multiple row versions, this
>>> will need to be revisited.
>>>
>>> 2. Rewrite the recheck code in execReplication.c so that it uses the
>>> snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
>>> and use the same snapshot in the index scan and table_tuple_lock().
>>> Acquiring a snapshot isn't free though, so it would be nice to avoid
>>> doing that when the heapam is just going to ignore it anyway. If we go
>>> with this option, I think we could reuse the snapshot that is already
>>> active in most cases, and only take a new snapshot if the tuple was
>>> concurrently updated.
>>>
>>> 3. Modify the tableam interface so that the index scan can return a more
>>> unique identifier of the tuple version. In heapam, it could be the TID
>>> like today, but a different AM could return some other token. Continue
>>> to use SnapshotDirty in the index scan, but in the call to
>>> table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
>>> the token you got index_getnext_slot().
>>
>> Thank you for your finding and analysis. I would vote for #3.
>> Currently snapshot argument is not properly used, and the heapam code
>> doesn't check it. I think allowing flexible tuple identifier is
>> essential for future of table AM API. Therefore, any
>> snapshot/visibility information could be put inside that tuple
>> identifier if needed.
>
> To be more specific, I vote for #1 + #3. Remove the snapshot argument
> for now, then work on flexible tuple identifier.

I committed and backpatched the trivial fix for not, just replacing
GetLatestSnapshot() with GetActiveSnapshot(). I got cold feet on just
removing the argument without providing a replacement, because it's not
100% clear to me if the current situation is broken or not.

To summarize the issue, RelationFindReplTupleByIndex() performs an index
scan with DirtySnapshot, and then calls table_tuple_lock() with a fresh
MVCC snapshot to lock the row. There's no guarantee that the index scan
found the same row version that table_tuple_lock() will lock, if the TID
alone doesn't uniquely identify it. But that's still OK as long as the
key column hasn't changed, like with heapam's HOT updates. I couldn't
convince myself that it's broken nor that it's guaranteed to be correct
with other AMs.

Thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Naga Appani 2025-03-10 15:43:54 [Proposal] Expose internal MultiXact member count function for efficient monitoring
Previous Message Tomas Vondra 2025-03-10 15:16:02 Re: Changing the state of data checksums in a running cluster