Re: table_tuple_lock's snapshot argument

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: table_tuple_lock's snapshot argument
Date: 2025-01-21 10:05:34
Message-ID: CAPpHfdvMFn-f8t4a3e2Gcj8=+-uwUenNSk10KotKfXFepRzk3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> >
> > > /*
> > > * Lock a tuple in the specified mode.
> > > *
> > > * Input parameters:
> > > * relation: relation containing tuple (caller must hold suitable lock)
> > > * tid: TID of tuple to lock
> > > * snapshot: snapshot to use for visibility determinations
> > > * cid: current command ID (used for visibility test, and stored into
> > > * tuple's cmax if lock is successful)
> > > * mode: lock mode desired
> > > * wait_policy: what to do if tuple lock is not available
> > > * flags:
> > > * If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update chain to
> > > * also lock descendant tuples if lock modes don't conflict.
> > > * If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock
> > > * latest version.
> > > *
> > > * Output parameters:
> > > * *slot: contains the target tuple
> > > * *tmfd: filled in failure cases (see below)
> > > *
> > > * Function result may be:
> > > * TM_Ok: lock was successfully acquired
> > > * TM_Invisible: lock failed because tuple was never visible to us
> > > * TM_SelfModified: lock failed because tuple updated by self
> > > * TM_Updated: lock failed because tuple updated by other xact
> > > * TM_Deleted: lock failed because tuple deleted by other xact
> > > * TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
> > > *
> > > * In the failure cases other than TM_Invisible and TM_Deleted, the routine
> > > * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See
> > > * comments for struct TM_FailureData for additional info.
> > > */
> > > static inline TM_Result
> > > table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
> > > TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
> > > LockWaitPolicy wait_policy, uint8 flags,
> > > TM_FailureData *tmfd)
> >
> > What are the semantics of the 'snapshot' argument? In the heapam
> > implementation, it's not used for anything. What visibility checks the
> > function might do in a different implementation? I vaguely remember that
> > the idea was that the TID might not be sufficient to uniquely identify
> > the row version in something like zheap, which updates the row in place.
> > In that case, all the different row versions are represented by the same
> > TID, and the snapshot identifies the version.
> >
> > There are a few callers of table_tuple_lock:
> >
> > 1. trigger.c: GetTupleForTrigger
> > 2. nodeModifyTable.c
> > 3. nodeLockRows.c
> > 4. execReplication.c
> >
> > The first three callers pass the EState's snapshot, the same that was
> > used in a table or index scan that returned the TID. That makes sense.
> > But the calls in execReplication.c look fishy:
> >
> > > PushActiveSnapshot(GetLatestSnapshot());
> > >
> > > res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
> > > outslot,
> > > GetCurrentCommandId(false),
> > > lockmode,
> > > LockWaitBlock,
> > > 0 /* don't follow updates */ ,
> > > &tmfd);
> > >
> > > PopActiveSnapshot();
> > >
> > > if (should_refetch_tuple(res, &tmfd))
> > > goto retry;
> >
> > Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
> > I think the idea was to push the latest snapshot and use the same
> > snapshot in the call to table_tuple_lock(). But because each call to
> > GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
> > the active snapshot and passes a *different* snapshot to
> > table_tuple_lock(). This went wrong in commit 5db6df0c01, which
> > introduced the update/delete/insert/lock table AM interface. The
> > argument to table_tuple_lock() was supposed to be GetActiveSnapshot().
> >
> > However, I think GetLatestSnapshot() is wrong here anyway. None of this
> > matters for heapam which just ignores the 'snapshot' argument, but let's
> > assume a different AM that would use the snapshot to identify the tuple
> > version. The TID was fetched from an index using an index scan with
> > SnapshotDirty. There's no guarantee that the LatestSnapshot would match
> > the same tuple version that the index scan found. If an MVCC snapshot is
> > needed, surely it should be acquired before the index scan, and used for
> > the index scan as well.
> >
> > I see three options:
> >
> > 1. Remove the 'snapshot' argument, since it's not used by heapam. If we
> > get a table AM where a single TID represents multiple row versions, this
> > will need to be revisited.
> >
> > 2. Rewrite the recheck code in execReplication.c so that it uses the
> > snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
> > and use the same snapshot in the index scan and table_tuple_lock().
> > Acquiring a snapshot isn't free though, so it would be nice to avoid
> > doing that when the heapam is just going to ignore it anyway. If we go
> > with this option, I think we could reuse the snapshot that is already
> > active in most cases, and only take a new snapshot if the tuple was
> > concurrently updated.
> >
> > 3. Modify the tableam interface so that the index scan can return a more
> > unique identifier of the tuple version. In heapam, it could be the TID
> > like today, but a different AM could return some other token. Continue
> > to use SnapshotDirty in the index scan, but in the call to
> > table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
> > the token you got index_getnext_slot().
>
> Thank you for your finding and analysis. I would vote for #3.
> Currently snapshot argument is not properly used, and the heapam code
> doesn't check it. I think allowing flexible tuple identifier is
> essential for future of table AM API. Therefore, any
> snapshot/visibility information could be put inside that tuple
> identifier if needed.

To be more specific, I vote for #1 + #3. Remove the snapshot argument
for now, then work on flexible tuple identifier.

------
Regards,
Alexander Korotkov
Supabase

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hunaid Sohail 2025-01-21 10:17:21 Re: Psql meta-command conninfo+
Previous Message Alexander Korotkov 2025-01-21 10:03:58 Re: table_tuple_lock's snapshot argument