Re: race condition in pg_class

From: Noah Misch <noah(at)leadboat(dot)com>
To: Nitin Motiani <nitinmotiani(at)google(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Smolkin Grigory <smallkeen(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Subject: Re: race condition in pg_class
Date: 2024-08-31 01:10:43
Message-ID: 20240831011043.2b.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > My order of preference is: 2, 1, 3.
> >
> > I kept tuple locking responsibility in heapam.c. That's simpler and better
> > for modularity, but it does mean we release+acquire after any xmax wait.
> > Before, we avoided that if the next genam.c scan found the same TID. (If the
> > next scan finds the same TID, the xmax probably aborted.) I think DDL aborts
> > are rare enough to justify simplifying as this version does. I don't expect
> > anyone to notice the starvation outside of tests built to show it. (With
> > previous versions, one can show it with a purpose-built test that commits
> > instead of aborting, like the "001_pgbench_grant(at)9" test.)
> >
> > This move also loses the optimization of unpinning before XactLockTableWait().
> > heap_update() doesn't optimize that way, so that's fine.
> >
> > The move ended up more like (1), though I did do
> > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2). I
> > felt that worked better than (2) to achieve lock release before
> > CacheInvalidateHeapTuple(). Alternatives that could be fine:
> >
> From a consistency point of view, I find it cleaner if we can have all
> the heap_inplace_lock and heap_inplace_unlock in the same set of
> functions. So here those would be the systable_inplace_* functions.

That will technically be the case after inplace160, and I could make it so
here by inlining heap_inplace_unlock() into its heapam.c caller. Would that
be cleaner or less clean?

> > - In the cancel case, call both systable_inplace_update_cancel and
> > systable_inplace_update_end. _finish or _cancel would own unlock, while
> > _end would own systable_endscan().
> >
> What happens to CacheInvalidateHeapTuple() in this approach? I think
> it will still need to be brought to the genam.c layer if we are
> releasing the lock in systable_inplace_update_finish.

Cancel scenarios don't do invalidation. (Same under other alternatives.)

> > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While
> > tolerable now, this gets less attractive after the inplace160 patch from
> > https://postgr.es/m/flat/20240523000548(dot)58(dot)nmisch(at)google(dot)com
> >
> I skimmed through the inplace160 patch. It wasn't clear to me why this
> becomes less attractive with the patch. I see there is a new
> CacheInvalidateHeapTupleInPlace but that looks like it would be called
> while holding the lock. And then there is an
> AcceptInvalidationMessages which can perhaps be moved to the genam.c
> layer too. Is the concern that one invalidation call will be in the
> heapam layer and the other will be in the genam layer?

That, or a critical section would start in heapam.c, then end in genam.c.
Current call tree at inplace160 v4:

genam.c:systable_inplace_update_finish
heapam.c:heap_inplace_update
PreInplace_Inval
START_CRIT_SECTION
BUFFER_LOCK_UNLOCK
AtInplace_Inval
END_CRIT_SECTION
UnlockTuple
AcceptInvalidationMessages

If we hoisted all of invalidation up to the genam.c layer, a critical section
that starts in heapam.c would end in genam.c:

genam.c:systable_inplace_update_finish
PreInplace_Inval
heapam.c:heap_inplace_update
START_CRIT_SECTION
BUFFER_LOCK_UNLOCK
AtInplace_Inval
END_CRIT_SECTION
UnlockTuple
AcceptInvalidationMessages

If we didn't accept splitting the critical section but did accept splitting
invalidation responsibilities, one gets perhaps:

genam.c:systable_inplace_update_finish
PreInplace_Inval
heapam.c:heap_inplace_update
START_CRIT_SECTION
BUFFER_LOCK_UNLOCK
AtInplace_Inval
END_CRIT_SECTION
UnlockTuple
AcceptInvalidationMessages

That's how I ended up at inplace120 v9's design.

> Also I have a small question from inplace120.
>
> I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> we can just call
> IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> big advantage of storing a separate bool field? Also there is another

No, not a big advantage. I felt it was more in line with the typical style of
src/backend/executor.

> write to ri_RelationDesc in CatalogOpenIndexes in
> src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
> be set there also to keep it consistent with ri_RelationDesc. Please
> let me know if I am missing something about the usage of the new
> field.

Can you say more about consequences you found?

Only the full executor reads the field, doing so when it fetches the most
recent version of a row. CatalogOpenIndexes() callers lack the full
executor's practice of fetching the most recent version of a row, so they
couldn't benefit reading the field.

I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the
full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in
part to keep it that way. Since CatalogOpenIndexes() skips ri_TrigDesc and
other fields, I would expect other malfunctions if some caller tried.

Thanks,
nm

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-08-31 01:47:41 Re: query_id, pg_stat_activity, extended query protocol
Previous Message Noah Misch 2024-08-31 01:07:11 Re: Inval reliability, especially for inplace updates