From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: table_tuple_lock's snapshot argument |
Date: | 2025-01-21 10:03:58 |
Message-ID: | CAPpHfds1F+yc+pH3aN02uyxtG0Hk4M8eRSznXyMqp2tbR16uqw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi!
On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> > /*
> > * Lock a tuple in the specified mode.
> > *
> > * Input parameters:
> > * relation: relation containing tuple (caller must hold suitable lock)
> > * tid: TID of tuple to lock
> > * snapshot: snapshot to use for visibility determinations
> > * cid: current command ID (used for visibility test, and stored into
> > * tuple's cmax if lock is successful)
> > * mode: lock mode desired
> > * wait_policy: what to do if tuple lock is not available
> > * flags:
> > * If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update chain to
> > * also lock descendant tuples if lock modes don't conflict.
> > * If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain and lock
> > * latest version.
> > *
> > * Output parameters:
> > * *slot: contains the target tuple
> > * *tmfd: filled in failure cases (see below)
> > *
> > * Function result may be:
> > * TM_Ok: lock was successfully acquired
> > * TM_Invisible: lock failed because tuple was never visible to us
> > * TM_SelfModified: lock failed because tuple updated by self
> > * TM_Updated: lock failed because tuple updated by other xact
> > * TM_Deleted: lock failed because tuple deleted by other xact
> > * TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
> > *
> > * In the failure cases other than TM_Invisible and TM_Deleted, the routine
> > * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. See
> > * comments for struct TM_FailureData for additional info.
> > */
> > static inline TM_Result
> > table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
> > TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
> > LockWaitPolicy wait_policy, uint8 flags,
> > TM_FailureData *tmfd)
>
> What are the semantics of the 'snapshot' argument? In the heapam
> implementation, it's not used for anything. What visibility checks the
> function might do in a different implementation? I vaguely remember that
> the idea was that the TID might not be sufficient to uniquely identify
> the row version in something like zheap, which updates the row in place.
> In that case, all the different row versions are represented by the same
> TID, and the snapshot identifies the version.
>
> There are a few callers of table_tuple_lock:
>
> 1. trigger.c: GetTupleForTrigger
> 2. nodeModifyTable.c
> 3. nodeLockRows.c
> 4. execReplication.c
>
> The first three callers pass the EState's snapshot, the same that was
> used in a table or index scan that returned the TID. That makes sense.
> But the calls in execReplication.c look fishy:
>
> > PushActiveSnapshot(GetLatestSnapshot());
> >
> > res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
> > outslot,
> > GetCurrentCommandId(false),
> > lockmode,
> > LockWaitBlock,
> > 0 /* don't follow updates */ ,
> > &tmfd);
> >
> > PopActiveSnapshot();
> >
> > if (should_refetch_tuple(res, &tmfd))
> > goto retry;
>
> Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
> I think the idea was to push the latest snapshot and use the same
> snapshot in the call to table_tuple_lock(). But because each call to
> GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
> the active snapshot and passes a *different* snapshot to
> table_tuple_lock(). This went wrong in commit 5db6df0c01, which
> introduced the update/delete/insert/lock table AM interface. The
> argument to table_tuple_lock() was supposed to be GetActiveSnapshot().
>
> However, I think GetLatestSnapshot() is wrong here anyway. None of this
> matters for heapam which just ignores the 'snapshot' argument, but let's
> assume a different AM that would use the snapshot to identify the tuple
> version. The TID was fetched from an index using an index scan with
> SnapshotDirty. There's no guarantee that the LatestSnapshot would match
> the same tuple version that the index scan found. If an MVCC snapshot is
> needed, surely it should be acquired before the index scan, and used for
> the index scan as well.
>
> I see three options:
>
> 1. Remove the 'snapshot' argument, since it's not used by heapam. If we
> get a table AM where a single TID represents multiple row versions, this
> will need to be revisited.
>
> 2. Rewrite the recheck code in execReplication.c so that it uses the
> snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
> and use the same snapshot in the index scan and table_tuple_lock().
> Acquiring a snapshot isn't free though, so it would be nice to avoid
> doing that when the heapam is just going to ignore it anyway. If we go
> with this option, I think we could reuse the snapshot that is already
> active in most cases, and only take a new snapshot if the tuple was
> concurrently updated.
>
> 3. Modify the tableam interface so that the index scan can return a more
> unique identifier of the tuple version. In heapam, it could be the TID
> like today, but a different AM could return some other token. Continue
> to use SnapshotDirty in the index scan, but in the call to
> table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
> the token you got index_getnext_slot().
Thank you for your finding and analysis. I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it. I think allowing flexible tuple identifier is
essential for future of table AM API. Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.
------
Regards,
Alexander Korotkov
Supabase
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2025-01-21 10:05:34 | Re: table_tuple_lock's snapshot argument |
Previous Message | Álvaro Herrera | 2025-01-21 09:44:51 | Re: Bug in detaching a partition with a foreign key. |