Re: race condition in pg_class

From: Noah Misch <noah(at)leadboat(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Smolkin Grigory <smallkeen(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Subject: Re: race condition in pg_class
Date: 2024-08-22 07:32:00
Message-ID: 20240822073200.4f.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> On 17/08/2024 07:07, Noah Misch wrote:
> > On Fri, Aug 16, 2024 at 12:26:28PM +0300, Heikki Linnakangas wrote:
> > > I wonder if the functions should be called "systable_*" and placed in
> > > genam.c rather than in heapam.c. The interface looks more like the existing
> > > systable functions. It feels like a modularity violation for a function in
> > > heapam.c to take an argument like "indexId", and call back into systable_*
> > > functions.
> >
> > Yes, _scan() and _cancel() especially are wrappers around systable. Some API
> > options follow. Any preference or other ideas?
> >
> > ==== direct s/heap_/systable_/ rename [option 1]
> >
> > systable_inplace_update_scan([...], &tup, &inplace_state);
> > if (!HeapTupleIsValid(tup))
> > elog(ERROR, [...]);
> > ... [buffer is exclusive-locked; mutate "tup"] ...
> > if (dirty)
> > systable_inplace_update_finish(inplace_state, tup);
> > else
> > systable_inplace_update_cancel(inplace_state);
> >
> > ==== make the first and last steps more systable-like [option 2]
> >
> > systable_inplace_update_begin([...], &tup, &inplace_state);
> > if (!HeapTupleIsValid(tup))
> > elog(ERROR, [...]);
> > ... [buffer is exclusive-locked; mutate "tup"] ...
> > if (dirty)
> > systable_inplace_update(inplace_state, tup);
> > systable_inplace_update_end(inplace_state);

> My order of preference is: 2, 1, 3.

I kept tuple locking responsibility in heapam.c. That's simpler and better
for modularity, but it does mean we release+acquire after any xmax wait.
Before, we avoided that if the next genam.c scan found the same TID. (If the
next scan finds the same TID, the xmax probably aborted.) I think DDL aborts
are rare enough to justify simplifying as this version does. I don't expect
anyone to notice the starvation outside of tests built to show it. (With
previous versions, one can show it with a purpose-built test that commits
instead of aborting, like the "001_pgbench_grant(at)9" test.)

This move also loses the optimization of unpinning before XactLockTableWait().
heap_update() doesn't optimize that way, so that's fine.

The move ended up more like (1), though I did do
s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2). I
felt that worked better than (2) to achieve lock release before
CacheInvalidateHeapTuple(). Alternatives that could be fine:

- In the cancel case, call both systable_inplace_update_cancel and
systable_inplace_update_end. _finish or _cancel would own unlock, while
_end would own systable_endscan().

- Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While
tolerable now, this gets less attractive after the inplace160 patch from
https://postgr.es/m/flat/20240523000548(dot)58(dot)nmisch(at)google(dot)com

I made the other changes we discussed, also.

> > > Could we just stipulate that you must always hold LOCKTAG_TUPLE when you
> > > call heap_update() on pg_class or pg_database? That'd make the rule simple.
> >
> > We could. That would change more code sites. Rough estimate:
> >
> > $ git grep -E CatalogTupleUpd'.*(class|relrelation|relationRelation)' | wc -l
> > 23

> How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> always require a tuple lock on indexes, if that would make a difference.

Three sites. See attached inplace125 patch. Is it a net improvement? If so,
I'll squash it into inplace120.

> > Another option here would be to preface that README section with a simplified
> > view, something like, "If a warning brought you here, take a tuple lock. The
> > rest of this section is just for people needing to understand the conditions
> > for --enable-casserts emitting that warning." How about that instead of
> > simplifying the rules?
>
> Works for me. Or perhaps the rules could just be explained more succinctly.
> Something like:
>
> -----

I largely used your text instead.

While doing these updates, I found an intra-grant-inplace.spec permutation
being flaky on inplace110 but stable on inplace120. That turned out not to be
v9-specific. As of patch v1, I now see it was already flaky (~5% failure
here). I've now added to inplace110 a minimal tweak to stabilize that spec,
which inplace120 removes.

Thanks,
nm

Attachment Content-Type Size
inplace090-LOCKTAG_TUPLE-eoxact-v9.patch text/plain 1.3 KB
inplace110-successors-v9.patch text/plain 44.4 KB
inplace120-locktag-v9.patch text/plain 47.0 KB
inplace125-no-exception-for-indexes-v9.patch text/plain 10.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-08-22 07:33:54 Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bind
Previous Message Jelte Fennema-Nio 2024-08-22 07:31:54 Re: RFC: Additional Directory for Extensions