Re: race condition in pg_class

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Smolkin Grigory <smallkeen(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: race condition in pg_class
Date: 2023-11-02 03:09:15
Message-ID: 20231102030915.d3.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I prototyped two ways, one with a special t_ctid and one with LockTuple().

On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > On Fri, Oct 27, 2023 at 03:32:26PM -0400, Tom Lane wrote:

> > > I anticipate a new locktag per catalog that can receive inplace updates,
> > > i.e. LOCKTAG_RELATION_DEFINITION and LOCKTAG_DATABASE_DEFINITION.
> >
> > We could perhaps make this work by using the existing tuple-lock
> > infrastructure, rather than inventing new locktags (a choice that
> > spills to a lot of places including clients that examine pg_locks).
>
> That could be okay. It would be weird to reuse a short-term lock like that
> one as something held till end of transaction. But the alternative of new
> locktags ain't perfect, as you say.

That worked.

> > I would prefer though to find a solution that only depends on making
> > heap_inplace_update protect itself, without high-level cooperation
> > from the possibly-interfering updater. This is basically because
> > I'm still afraid that we're defining the problem too narrowly.
> > For one thing, I have nearly zero confidence that GRANT et al are
> > the only problematic source of conflicting transactional updates.
>
> Likewise here, but I have fair confidence that an assertion would flush out
> the rest. heap_inplace_update() would assert that the backend holds one of
> the acceptable locks. It could even be an elog; heap_inplace_update() can
> tolerate that cost.

That check would fall in both heap_inplace_update() and heap_update(). After
all, a heap_inplace_update() check won't detect an omission in GRANT.

> > For another, I'm worried that some extension may be using
> > heap_inplace_update against a catalog we're not considering here.
>
> A pgxn search finds "citus" using heap_inplace_update().
>
> > I'd also like to find a solution that fixes the case of a conflicting
> > manual UPDATE (although certainly that's a stretch goal we may not be
> > able to reach).
>
> It would be nice.

I expect most approaches could get there by having ExecModifyTable() arrange
for the expected locking or other actions. That's analogous to how
heap_update() takes care of sinval even for a manual UPDATE.

> > I wonder if there's a way for heap_inplace_update to mark the tuple
> > header as just-updated in a way that regular heap_update could
> > recognize. (For standard catalog updates, we'd then end up erroring
> > in simple_heap_update, which I think is fine.) We can't update xmin,
> > because the VACUUM callers don't have an XID; but maybe there's some
> > other way? I'm speculating about putting a funny value into xmax,
> > or something like that, and having heap_update check that what it
> > sees in xmax matches what was in the tuple the update started with.
>
> Hmmm. Achieving it without an XID would be the real trick. (With an XID, we
> could use xl_heap_lock like heap_update() does.) Thinking out loud, what if
> heap_inplace_update() sets HEAP_XMAX_INVALID and xmax =
> TransactionIdAdvance(xmax)? Or change t_ctid in a similar way. Then regular
> heap_update() could complain if the field changed vs. last seen value. This
> feels like something to regret later in terms of limiting our ability to
> harness those fields for more-valuable ends or compact them away in a future
> page format. I can't pinpoint a specific loss, so the idea might have legs.
> Nontransactional data in separate tables or in new metapages smells like the
> right long-term state. A project wanting to reuse the tuple header bits could
> introduce such storage to unblock its own bit reuse.

heap_update() does not have the pre-modification xmax today, so I used t_ctid.
heap_modify_tuple() preserves t_ctid, so heap_update() already has the
pre-modification t_ctid in key cases. For details of how the prototype uses
t_ctid, see comment at "#define InplaceCanaryOffsetNumber". The prototype
doesn't prevent corruption in the following scenario, because the aborted
ALTER TABLE RENAME overwrites the special t_ctid:

== session 1
drop table t;
create table t (c int);
begin;
-- in gdb, set breakpoint on heap_modify_tuple
grant select on t to public;

== session 2
alter table t add primary key (c);
begin; alter table t rename to t2; rollback;

== back in session 1
-- release breakpoint
-- want error (would get it w/o the begin;alter;rollback)
commit;

I'm missing how to mark the tuple in a fashion accessible to a second
heap_update() after a rolled-back heap_update(). The mark needs enough bits
"N" so it's implausible for 2^N inplace updates to happen between GRANT
fetching the old tuple and GRANT completing heap_update(). Looking for bits
that could persist across a rolled-back heap_update(), we have 3 in t_ctid, 2
in t_infomask2, and 0 in xmax. I definitely don't want to paint us into a
corner by spending the t_infomask2 bits on this. Even if I did, 2^(3+2)=32
wouldn't clearly be enough inplace updates.

Is there a way to salvage the goal of fixing the bug without modifying code
like ExecGrant_common()? If not, I'm inclined to pick from one of the
following designs:

- Acquire ShareUpdateExclusiveLock in GRANT ((B1) from previous list). It
does make GRANT more intrusive; e.g. GRANT will cancel autovacuum. I'm
leaning toward this one for two reasons. First, it doesn't slow
heap_update() outside of assert builds. Second, it makes the GRANT
experience more like the rest our DDL, in that concurrent DDL will make
GRANT block, not fail.

- GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
heap_update() compares those bytes to the oldtup in shared buffers to see if
an inplace update happened. (HEAD could get the bytes from a new
heap_update() parameter, while back branches would need a different passing
approach.)

- LockTuple(), as seen in its attached prototype. I like this least at the
moment, because it changes pg_locks content without having a clear advantage
over the previous option. Also, the prototype has enough FIXME markers that
I expect this to get hairy before it's done.

I might change my preference after further prototypes. Does anyone have a
strong preference between those? Briefly, I did consider these additional
alternatives:

- Just accept the yet-rarer chance of corruption from this message's test
procedure.

- Hold a buffer lock long enough to solve things.

- Remember the tuples where we overwrote a special t_ctid, and reverse the
overwrite during abort processing. But I/O in the abort path sounds bad.

Thanks,
nm

Attachment Content-Type Size
intra-grant-inplace-via-ctid-v0.patch text/plain 7.7 KB
intra-grant-inplace-via-tuplock-v0.patch text/plain 10.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2023-11-02 03:24:50 Re: pg_rewind WAL segments deletion pitfall
Previous Message Kyotaro Horiguchi 2023-11-02 02:58:34 Re: A recent message added to pg_upgade