Re: race condition in pg_class

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Smolkin Grigory <smallkeen(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Subject: Re: race condition in pg_class
Date: 2024-08-16 09:26:28
Message-ID: c2bc625f-2a79-4be0-a154-72b07c347703@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14/07/2024 20:48, Noah Misch wrote:
> I've pushed the two patches for your reports. To placate cfbot, I'm attaching
> the remaining patches.

inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would
be in order, it looks pretty random as it is. Something like:

/*
* Tuple locks are currently only held for short durations within a
* transaction. Check that we didn't forget to release one.
*/

inplace110-successors-v8.patch: Makes sense.

The README changes would be better as part of the third patch, as this
patch doesn't actually do any of the new locking described in the
README, and it fixes the "inplace update updates wrong tuple" bug even
without those tuple locks.

> + * ... [any slow preparation not requiring oldtup] ...
> + * heap_inplace_update_scan([...], &tup, &inplace_state);
> + * if (!HeapTupleIsValid(tup))
> + * elog(ERROR, [...]);
> + * ... [buffer is exclusive-locked; mutate "tup"] ...
> + * if (dirty)
> + * heap_inplace_update_finish(inplace_state, tup);
> + * else
> + * heap_inplace_update_cancel(inplace_state);

I wonder if the functions should be called "systable_*" and placed in
genam.c rather than in heapam.c. The interface looks more like the
existing systable functions. It feels like a modularity violation for a
function in heapam.c to take an argument like "indexId", and call back
into systable_* functions.

> /*----------
> * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
> *
> * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
> * ["R" is a VACUUM tbl]
> * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
> * c * D: systable_getnext() returns pg_class tuple of tbl
> * R: memcpy() into pg_class tuple of tbl
> * D: raise pg_database.datfrozenxid, XLogInsert(), finish
> * [crash]
> * [recovery restores datfrozenxid w/o relfrozenxid]
> */

Hmm, that's a tight race, but feels bad to leave it unfixed. One
approach would be to modify the tuple on the buffer only after
WAL-logging it. That way, D cannot read the updated value before it has
been WAL logged. Just need to make sure that the change still gets
included in the WAL record. Maybe something like:

if (RelationNeedsWAL(relation))
{
/*
* Make a temporary copy of the page that includes the change, in
* case the a full-page image is logged
*/
PGAlignedBlock tmppage;

memcpy(tmppage.data, page, BLCKSZ);

/* copy the tuple to the temporary copy */
memcpy(...);

XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
XLogInsert();
}

/* copy the tuple to the buffer */
memcpy(...);

> pg_class heap_inplace_update_scan() callers: before the call, acquire
> LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
> (VACUUM), or a mode with strictly more conflicts. If the update targets a
> row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
> on the table. Locking the index rel is optional. (This allows VACUUM to
> overwrite per-index pg_class while holding a lock on the table alone.) We
> could allow weaker locks, in which case the next paragraph would simply call
> for stronger locks for its class of commands. heap_inplace_update_scan()
> acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
> ExclusiveLock, on each tuple it overwrites.
>
> pg_class heap_update() callers: before copying the tuple to modify, take a
> lock that conflicts with at least one of those from the preceding paragraph.
> SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
> After heap_update(), release any LOCKTAG_TUPLE. Most of these callers opt
> to acquire just the LOCKTAG_RELATION.

These rules seem complicated. Phrasing this slightly differently, if I
understand correctly: for a heap_update() caller, it's always sufficient
to hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the
relation that conflicts with those mentioned in the first paragraph,
then you can skip the LOCKTAG_TUPLE lock.

Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
call heap_update() on pg_class or pg_database? That'd make the rule simple.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-08-16 10:11:18 Re: [PATCH] Add get_bytes() and set_bytes() functions
Previous Message shveta malik 2024-08-16 09:24:41 Re: Conflict detection and logging in logical replication