Re: race condition in pg_class

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Smolkin Grigory <smallkeen(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Subject: Re: race condition in pg_class
Date: 2024-08-17 04:07:48
Message-ID: 20240817040748.f8.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for reviewing.

On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> On 14/07/2024 20:48, Noah Misch wrote:
> > I've pushed the two patches for your reports. To placate cfbot, I'm attaching
> > the remaining patches.
>
> inplace090-LOCKTAG_TUPLE-eoxact-v8.patch: Makes sense. A comment would be in
> order, it looks pretty random as it is. Something like:
>
> /*
> * Tuple locks are currently only held for short durations within a
> * transaction. Check that we didn't forget to release one.
> */

Will add.

> inplace110-successors-v8.patch: Makes sense.
>
> The README changes would be better as part of the third patch, as this patch
> doesn't actually do any of the new locking described in the README, and it
> fixes the "inplace update updates wrong tuple" bug even without those tuple
> locks.

That should work. Will confirm.

> > + * ... [any slow preparation not requiring oldtup] ...
> > + * heap_inplace_update_scan([...], &tup, &inplace_state);
> > + * if (!HeapTupleIsValid(tup))
> > + * elog(ERROR, [...]);
> > + * ... [buffer is exclusive-locked; mutate "tup"] ...
> > + * if (dirty)
> > + * heap_inplace_update_finish(inplace_state, tup);
> > + * else
> > + * heap_inplace_update_cancel(inplace_state);
>
> I wonder if the functions should be called "systable_*" and placed in
> genam.c rather than in heapam.c. The interface looks more like the existing
> systable functions. It feels like a modularity violation for a function in
> heapam.c to take an argument like "indexId", and call back into systable_*
> functions.

Yes, _scan() and _cancel() especially are wrappers around systable. Some API
options follow. Any preference or other ideas?

==== direct s/heap_/systable_/ rename

systable_inplace_update_scan([...], &tup, &inplace_state);
if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
... [buffer is exclusive-locked; mutate "tup"] ...
if (dirty)
systable_inplace_update_finish(inplace_state, tup);
else
systable_inplace_update_cancel(inplace_state);

==== make the first and last steps more systable-like

systable_inplace_update_begin([...], &tup, &inplace_state);
if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
... [buffer is exclusive-locked; mutate "tup"] ...
if (dirty)
systable_inplace_update(inplace_state, tup);
systable_inplace_update_end(inplace_state);

==== no systable_ wrapper for middle step, more like CatalogTupleUpdate

systable_inplace_update_begin([...], &tup, &inplace_state);
if (!HeapTupleIsValid(tup))
elog(ERROR, [...]);
... [buffer is exclusive-locked; mutate "tup"] ...
if (dirty)
heap_inplace_update(relation,
systable_inplace_old_tuple(inplace_state),
tup,
systable_inplace_buffer(inplace_state));
systable_inplace_update_end(inplace_state);

> > /*----------
> > * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
> > *
> > * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
> > * ["R" is a VACUUM tbl]
> > * D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
> > * c * D: systable_getnext() returns pg_class tuple of tbl
> > * R: memcpy() into pg_class tuple of tbl
> > * D: raise pg_database.datfrozenxid, XLogInsert(), finish
> > * [crash]
> > * [recovery restores datfrozenxid w/o relfrozenxid]
> > */
>
> Hmm, that's a tight race, but feels bad to leave it unfixed. One approach
> would be to modify the tuple on the buffer only after WAL-logging it. That
> way, D cannot read the updated value before it has been WAL logged. Just
> need to make sure that the change still gets included in the WAL record.
> Maybe something like:
>
> if (RelationNeedsWAL(relation))
> {
> /*
> * Make a temporary copy of the page that includes the change, in
> * case the a full-page image is logged
> */
> PGAlignedBlock tmppage;
>
> memcpy(tmppage.data, page, BLCKSZ);
>
> /* copy the tuple to the temporary copy */
> memcpy(...);
>
> XLogRegisterBlock(0, ..., tmppage, REGBUF_STANDARD);
> XLogInsert();
> }
>
> /* copy the tuple to the buffer */
> memcpy(...);

Yes, that's the essence of
inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patch from
https://postgr.es/m/flat/20240620012908(dot)92(dot)nmisch(at)google(dot)com(dot)

> > pg_class heap_inplace_update_scan() callers: before the call, acquire
> > LOCKTAG_RELATION in mode ShareLock (CREATE INDEX), ShareUpdateExclusiveLock
> > (VACUUM), or a mode with strictly more conflicts. If the update targets a
> > row of RELKIND_INDEX (but not RELKIND_PARTITIONED_INDEX), that lock must be
> > on the table. Locking the index rel is optional. (This allows VACUUM to
> > overwrite per-index pg_class while holding a lock on the table alone.) We
> > could allow weaker locks, in which case the next paragraph would simply call
> > for stronger locks for its class of commands. heap_inplace_update_scan()
> > acquires and releases LOCKTAG_TUPLE in InplaceUpdateTupleLock, an alias for
> > ExclusiveLock, on each tuple it overwrites.
> >
> > pg_class heap_update() callers: before copying the tuple to modify, take a
> > lock that conflicts with at least one of those from the preceding paragraph.
> > SearchSysCacheLocked1() is one convenient way to acquire LOCKTAG_TUPLE.
> > After heap_update(), release any LOCKTAG_TUPLE. Most of these callers opt
> > to acquire just the LOCKTAG_RELATION.
>
> These rules seem complicated. Phrasing this slightly differently, if I
> understand correctly: for a heap_update() caller, it's always sufficient to
> hold LOCKTAG_TUPLE, but if you happen to hold some other lock on the
> relation that conflicts with those mentioned in the first paragraph, then
> you can skip the LOCKTAG_TUPLE lock.

Yes.

> Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
> call heap_update() on pg_class or pg_database? That'd make the rule simple.

We could. That would change more code sites. Rough estimate:

$ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l
23

If the count were 2, I'd say let's simplify the rule like you're exploring.
(I originally had a complicated rule for pg_database, but I abandoned that
when it helped few code sites.) If it were 100, I'd say the complicated rule
is worth it. A count of 23 makes both choices fair.

Long-term, I hope relfrozenxid gets reimplemented with storage outside
pg_class, removing the need for inplace updates. So the additional 23 code
sites might change back at a future date. That shouldn't be a big
consideration, though.

Another option here would be to preface that README section with a simplified
view, something like, "If a warning brought you here, take a tuple lock. The
rest of this section is just for people needing to understand the conditions
for --enable-casserts emitting that warning." How about that instead of
simplifying the rules?

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-08-17 05:14:10 Re: Speed up Hash Join by teaching ExprState about hashing
Previous Message Thomas Munro 2024-08-17 01:22:30 Re: macOS prefetching support