Re: race condition in pg_class

From: Nitin Motiani <nitinmotiani(at)google(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Smolkin Grigory <smallkeen(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Subject: Re: race condition in pg_class
Date: 2024-09-03 15:54:52
Message-ID: CAH5HC96yPfrLVQ-srZ65R73t_dBLNM4mCR-RGcT5zNa04DHSxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 31, 2024 at 6:40 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Thu, Aug 29, 2024 at 09:08:43PM +0530, Nitin Motiani wrote:
> > On Thu, Aug 29, 2024 at 8:11 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > > > My order of preference is: 2, 1, 3.
> > >
> > > I kept tuple locking responsibility in heapam.c. That's simpler and better
> > > for modularity, but it does mean we release+acquire after any xmax wait.
> > > Before, we avoided that if the next genam.c scan found the same TID. (If the
> > > next scan finds the same TID, the xmax probably aborted.) I think DDL aborts
> > > are rare enough to justify simplifying as this version does. I don't expect
> > > anyone to notice the starvation outside of tests built to show it. (With
> > > previous versions, one can show it with a purpose-built test that commits
> > > instead of aborting, like the "001_pgbench_grant(at)9" test.)
> > >
> > > This move also loses the optimization of unpinning before XactLockTableWait().
> > > heap_update() doesn't optimize that way, so that's fine.
> > >
> > > The move ended up more like (1), though I did do
> > > s/systable_inplace_update_scan/systable_inplace_update_begin/ like in (2). I
> > > felt that worked better than (2) to achieve lock release before
> > > CacheInvalidateHeapTuple(). Alternatives that could be fine:
> > >
> > From a consistency point of view, I find it cleaner if we can have all
> > the heap_inplace_lock and heap_inplace_unlock in the same set of
> > functions. So here those would be the systable_inplace_* functions.
>
> That will technically be the case after inplace160, and I could make it so
> here by inlining heap_inplace_unlock() into its heapam.c caller. Would that
> be cleaner or less clean?
>

I am not sure. It seems more inconsistent to take the lock using
heap_inplace_lock but then just unlock by calling LockBuffer. On the
other hand, it doesn't seem that different from the way
SearchSysCacheLocked1 and UnlockTuple are used in inplace120. If we
are doing it this way, perhaps it would be good to rename
heap_inplace_update to heap_inplace_update_and_unlock.

> > > - In the cancel case, call both systable_inplace_update_cancel and
> > > systable_inplace_update_end. _finish or _cancel would own unlock, while
> > > _end would own systable_endscan().
> > >
> > What happens to CacheInvalidateHeapTuple() in this approach? I think
> > it will still need to be brought to the genam.c layer if we are
> > releasing the lock in systable_inplace_update_finish.
>
> Cancel scenarios don't do invalidation. (Same under other alternatives.)
>

Sorry, I wasn't clear about this one. Let me rephrase. My
understanding is that the code in this approach would look like below
:
if (dirty)
systable_inplace_update_finish(inplace_state, tup);
else
systable_inplace_update_cancel(inplace_state);
systable_inplace_update_end(inplace_state);

And that in this structure, both _finish and _cancel will call
heap_inplace_unlock and then _end will call systable_endscan. So even
with this structure, the invalidation has to happen inside _finish
after the unlock. So this also pulls the invalidation to the genam.c
layer. Am I understanding this correctly?

> > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer. While
> > > tolerable now, this gets less attractive after the inplace160 patch from
> > > https://postgr.es/m/flat/20240523000548(dot)58(dot)nmisch(at)google(dot)com
> > >
> > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > becomes less attractive with the patch. I see there is a new
> > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > while holding the lock. And then there is an
> > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > layer too. Is the concern that one invalidation call will be in the
> > heapam layer and the other will be in the genam layer?
>
> That, or a critical section would start in heapam.c, then end in genam.c.
> Current call tree at inplace160 v4:
>
> genam.c:systable_inplace_update_finish
> heapam.c:heap_inplace_update
> PreInplace_Inval
> START_CRIT_SECTION
> BUFFER_LOCK_UNLOCK
> AtInplace_Inval
> END_CRIT_SECTION
> UnlockTuple
> AcceptInvalidationMessages
>
> If we hoisted all of invalidation up to the genam.c layer, a critical section
> that starts in heapam.c would end in genam.c:
>
> genam.c:systable_inplace_update_finish
> PreInplace_Inval
> heapam.c:heap_inplace_update
> START_CRIT_SECTION
> BUFFER_LOCK_UNLOCK
> AtInplace_Inval
> END_CRIT_SECTION
> UnlockTuple
> AcceptInvalidationMessages
>
> If we didn't accept splitting the critical section but did accept splitting
> invalidation responsibilities, one gets perhaps:
>
> genam.c:systable_inplace_update_finish
> PreInplace_Inval
> heapam.c:heap_inplace_update
> START_CRIT_SECTION
> BUFFER_LOCK_UNLOCK
> AtInplace_Inval
> END_CRIT_SECTION
> UnlockTuple
> AcceptInvalidationMessages
>

How about this alternative?

genam.c:systable_inplace_update_finish
PreInplace_Inval
START_CRIT_SECTION
heapam.c:heap_inplace_update
BUFFER_LOCK_UNLOCK
AtInplace_Inval
END_CRIT_SECTION
UnlockTuple
AcceptInvalidationMessages

Looking at inplace160, it seems that the start of the critical section
is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
and END_CRIT_SECTION out to the genam.c layer? Alternatively since
heap_inplace_update is commented as a subroutine of
systable_inplace_update_finish, should everything just be moved to the
genam.c layer? Although it looks like you already considered and
rejected this approach. So just pulling out the critical sections
start and end is fine. Am I missing something here?

If the above alternatives are not possible, it's probably fine to go
ahead with the current patch with the function renamed to
heap_inplace_update_and_unlock (or something similar) as mentioned
earlier?

> That's how I ended up at inplace120 v9's design.
>
> > Also I have a small question from inplace120.
> >
> > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > we can just call
> > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > big advantage of storing a separate bool field? Also there is another
>
> No, not a big advantage. I felt it was more in line with the typical style of
> src/backend/executor.
>

Thanks for the clarification. For ri_TrigDesc, I see the following
comment in execMain.c :

/* make a copy so as not to depend on relcache info not changing... */
resultRelInfo->ri_TrigDesc = CopyTriggerDesc(resultRelationDesc->trigdesc);

So in this case I see more value in having a separate field compared
to the bool field for ri_needLockTagTuple.

> > write to ri_RelationDesc in CatalogOpenIndexes in
> > src/backlog/catalog/indexing.c. I think ri_needLockTagTuple needs to
> > be set there also to keep it consistent with ri_RelationDesc. Please
> > let me know if I am missing something about the usage of the new
> > field.
>
> Can you say more about consequences you found?
>

My apologies that I wasn't clear. I haven't found any consequences. I
just find it a smell that there are two fields which are not
independent and they go out of sync. And that's why my preference is
to not have a dependent field unless there is a specific advantage.

> Only the full executor reads the field, doing so when it fetches the most
> recent version of a row. CatalogOpenIndexes() callers lack the full
> executor's practice of fetching the most recent version of a row, so they
> couldn't benefit reading the field.
>
> I don't think any CatalogOpenIndexes() caller passes its ResultRelInfo to the
> full executor, and "typedef struct ResultRelInfo *CatalogIndexState" exists in
> part to keep it that way. Since CatalogOpenIndexes() skips ri_TrigDesc and
> other fields, I would expect other malfunctions if some caller tried.
>

Sorry, I missed the typedef. Thanks for pointing that out. I agree
that the likelihood of any malfunction is low. But even for the
ri_TrigDesc, CatalogOpenIndexes() still sets it to NULL. So shouldn't
ri_needLockTagTuple also be set to a default value of false? My
preference would be not to have a separate bool field to avoid
thinking about these scenarios.

Thanks & Regards
Nitin Motiani
Google

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Maciek Sakrejda 2024-09-03 15:58:51 Re: Broken layout: CommitFest Add Review Form
Previous Message Nathan Bossart 2024-09-03 15:50:20 Re: Jargon and acronyms on this mailing list