Re: race condition in pg_class

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Smolkin Grigory <smallkeen(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: race condition in pg_class
Date: 2024-05-12 23:29:23
Message-ID: 20240512232923.aa.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm attaching patches implementing the LockTuple() design. It turns out we
don't just lose inplace updates. We also overwrite unrelated tuples,
reproduced at inplace.spec. Good starting points are README.tuplock and the
heap_inplace_update_scan() header comment.

On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote:
> On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote:
> > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote:
> > > We could perhaps make this work by using the existing tuple-lock
> > > infrastructure, rather than inventing new locktags (a choice that
> > > spills to a lot of places including clients that examine pg_locks).

> > > I'd also like to find a solution that fixes the case of a conflicting
> > > manual UPDATE (although certainly that's a stretch goal we may not be
> > > able to reach).

I implemented that; search for ri_needLockTagTuple.

> - GRANT passes to heapam the fixed-size portion of the pre-modification tuple.
> heap_update() compares those bytes to the oldtup in shared buffers to see if
> an inplace update happened. (HEAD could get the bytes from a new
> heap_update() parameter, while back branches would need a different passing
> approach.)

This could have been fine, but ...

> - LockTuple(), as seen in its attached prototype. I like this least at the
> moment, because it changes pg_locks content without having a clear advantage
> over the previous option.

... I settled on the LockTuple() design for these reasons:

- Solves more conflicts by waiting, instead of by ERROR or by retry loops.
- Extensions wanting inplace updates don't have a big disadvantage over core
code inplace updates.
- One could use this to stop "tuple concurrently updated" for pg_class rows,
by using SearchSysCacheLocked1() for all pg_class DDL and making that
function wait for any existing xmax like inplace_xmax_lock() does. I don't
expect to write that, but it's a nice option to have.
- pg_locks shows the new lock acquisitions.

Separable, nontrivial things not fixed in the attached patch stack:

- Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of
CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right.

- AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical
section, but it is critical. We must not commit transactional DDL without
other backends receiving an inval. (When the inplace inval becomes
nontransactional, it will face the same threat.)

- Trouble is possible, I bet, if the system crashes between the inplace-update
memcpy() and XLogInsert(). See the new XXX comment below the memcpy().
Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
finally issuing memcpy() into the buffer.

- [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's
xmin does not stop pruning, an MVCC scan in that backend can find zero
tuples when one is live. This is like what all backends got in the days of
SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps
the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM,
setting that flag later and unsetting it earlier.)

If you find decisions in this thread's patches are tied to any of those such
that I should not separate those, let's discuss that. Topics in the patches
that I feel are most fruitful for debate:

- This makes inplace update block if the tuple has an updater. It's like one
GRANT blocking another, except an inplace updater won't get "ERROR: tuple
concurrently updated" like one of the GRANTs would. I had implemented
versions that avoided this blocking by mutating each tuple in the updated
tuple chain. That worked, but it had corner cases bad for maintainability,
listed in the inplace_xmax_lock() header comment. I'd rather accept the
blocking, so hackers can rule out those corner cases. A long-running GRANT
already hurts VACUUM progress more just by keeping an XID running.

- Pre-checks could make heap_inplace_update_cancel() calls rarer. Avoiding
one of those avoids an exclusive buffer lock, and it avoids waiting on
concurrent heap_update() if any. We'd pre-check the syscache tuple.
EventTriggerOnLogin() does it that way, because the code was already in that
form. I expect only vac_update_datfrozenxid() concludes !dirty enough to
matter. I didn't bother with the optimization, but it would be simple.

- If any citus extension user feels like porting its heap_inplace_update()
call to this, I'd value hearing about your experience.

- I paid more than my usual attention to test coverage, considering the patch
stack's intensity compared to most back-patch bug fixes.

I've kept all the above topics brief; feel free to ask for more details.

Thanks,
nm

Attachment Content-Type Size
inplace005-UNEXPECTEDPASS-tap-meson-v1.patch text/plain 1.6 KB
inplace010-tests-v1.patch text/plain 21.5 KB
inplace040-waitfuncs-v1.patch text/plain 8.2 KB
inplace050-tests-inj-v1.patch text/plain 11.5 KB
inplace060-nodeModifyTable-comments-v1.patch text/plain 6.7 KB
inplace070-rel-locks-missing-v1.patch text/plain 2.8 KB
inplace080-catcache-detoast-inplace-stale-v1.patch text/plain 4.7 KB
inplace090-LOCKTAG_TUPLE-eoxact-v1.patch text/plain 1.2 KB
inplace110-successors-v1.patch text/plain 47.6 KB
inplace120-locktag-v1.patch text/plain 42.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2024-05-13 00:51:41 Re: SQL:2011 application time
Previous Message Thomas Munro 2024-05-12 23:15:03 Re: WAL record CRC calculated incorrectly because of underlying buffer modification