From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #17821: Assertion failed in heap_update() due to heap pruning |
Date: | 2025-01-23 02:13:49 |
Message-ID: | 20250123021349.b0.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Sat, Jan 11, 2025 at 01:44:54PM -0800, Noah Misch wrote:
> On Tue, Nov 12, 2024 at 04:26:57PM -0800, Noah Misch wrote:
> > 1. Make the ItemIdIsNormal() check a runtime check that causes heap_update()
> > to return TM_Deleted. It could Assert(RelationSupportsSysCache(rel)), too.
> A UNIQUE constraint violation is too late to avoid page header corruption.
> the page is next read from disk, it gets "ERROR: invalid page".
> I plan to write (1).
Attached. Outside of comments and tests, it's a simple 20-line patch. Since
only two weeks will remain before release freeze, in the absence of review, I
would push this after 2025-01-25T16:00+0000. Unfortunately, while this
prevents the page header corruption, there's a preexisting loss of DDL changes
that it doesn't fix. I'll start a new thread about that, but I've included a
preview in Appendix 1 below. Would anyone else like to work on the problem in
Appendix 1? If so, let me know and feel free to start that thread yourself.
I may not get to it in 2025.
The new code sets TM_FailureData.xmax to invalid, which TM_Deleted may have
never done before. I see no problem at sites using that xmax:
src/backend/executor/nodeModifyTable.c:3137: if (TransactionIdIsCurrentTransactionId(context->tmfd.xmax))
src/backend/executor/nodeModifyTable.c:3307: if (TransactionIdIsCurrentTransactionId(context->tmfd.xmax))
src/backend/access/heap/heapam_handler.c:395: priorXmax = tmfd->xmax;
src/backend/access/heap/heapam_handler.c:488: tmfd->xmax = priorXmax;
The first two will rightly compute
TransactionIdIsCurrentTransactionId()==false. The second two are dealing with
tmfd from heap_lock_tuple(), not from heap_update(). Those grep matches are
from v18, but v13 and PGXN don't have additional challenges.
The test was sensitive at first to debug_parallel_query, debug_discard_caches,
autovacuum, and -DCATCACHE_FORCE_RELEASE. It would have avoided an
alternative output to have a GUC controlling -DCATCACHE_FORCE_RELEASE, like we
have debug_discard_caches to control CLOBBER_CACHE_ALWAYS. I'll not be
shocked if the buildfarm finds another way to make the test unstable. We may
end up disabling autovacuum for src/test/modules/injection_points, since
autovacuum doesn't mix with non-local injection points that autovacuum can
reach. For now, I've aimed to preserve compatibility with autovacuum=on.
==== Appendix 1: the unfixed loss of DDL changes
Unfortunately, I found a related case the patch doesn't fix. The known-to-me
fixes for this other case are more invasive than the fix for $SUBJECT, so I
plan to move forward with the $SUBJECT fix and bring the following to a new
thread. The third & last permutation of the $SUBJECT patch test case covers
the unfixed case. In brief, suppose the oldtup for a catalog update is a
syscache entry with xmin=1234 ctid=(11,38). We call CatalogTupleUpdate(rel,
'(11,38)', newtup). Meanwhile, the tuple has been updated twice with a VACUUM
between the two updates, so (11,38) now contains an xmin=5678 tuple for the
same primary key. CatalogTupleUpdate() has no way to know that its caller
used a stale oldtup, so the effect is to discard the two intervening updates.
To fix that, I lean toward the following:
- Make heap_modify_tuple() copy xmin from the oldtup.
- Make simple_heap_update() assert that newtup contains a valid xmin.
- In heap_update(), if newtup contains a valid xmin and otid now points to
tuple with a different xmin, fail with TM_Deleted. (Invalid xmin is also
okay for heap_update() callers that don't route through
simple_heap_update(), e.g. ExecUpdate(). ExecUpdate() has a snapshot
preventing the loss.)
Catalog updates not based on heap_copytuple() or heap_modify_tuple() would
need changes. They'll keep working in non-assert builds if they pass invalid
xmin. They'll break if they pass garbage bytes in xmin.
Better ideas? Other ways our lack of snapshot protecting oldtup might break?
Two of my upthread alternatives would also fix this, but they both entail the
churn of changing every CatalogTupleUpdate() caller (6 PGXN modules and ~200
postgresql.git calls):
> > 3. Before using a syscache tuple for a heap_update(), pin its buffer and
> > compare the syscache xmin to the buffer tuple xmin. Normally, they'll
> > match. If they don't match, ERROR or re-fetch the latest tuple as though
> > the syscache lookup had experienced a cache miss.
> > 4. Take a self-exclusive heavyweight lock before we use the syscache to fetch
> > the old tuple for a heap_update(). That's how we avoid this if the only
> > catalog-modifying commands are ALTER TABLE.
==== Appendix 2: UNIQUE violation helps less than theorized
The repro-ItemIdIsNormal-v2.patch test case (see previous message of thread)
didn't get the UNIQUE violation I expected. I opted to find out why not,
because I thought at the time we'll be relying long-term on the UNIQUE
violation when a different live tuple has replaced the ctid we're updating.
The problem update makes index entries for its new tuple, but this code stops
us from reporting the violation:
* Otherwise we have a definite conflict. But before
* complaining, look to see if the tuple we want to insert
* is itself now committed dead --- if so, don't complain.
* This is a waste of time in normal scenarios but we must
* do it to support CREATE INDEX CONCURRENTLY.
...
if (table_index_fetch_tuple_check(heapRel, &htid,
SnapshotSelf, NULL))
Since the page header corruption sets pd_lower=0, PageGetMaxOffsetNumber()
concludes the page has no tuples, and that table_index_fetch_tuple_check()
finds nothing.
Attachment | Content-Type | Size |
---|---|---|
update-unused-lp-10-no-installcheck-v1.patch | text/plain | 1.2 KB |
update-unused-lp-40-xid2fxid-v1.patch | text/plain | 10.1 KB |
update-unused-lp-50-fix-v1.patch | text/plain | 24.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2025-01-23 04:00:57 | BUG #18783: 2025-01-23 03:55:06.243 GMT [22929] LOG: postmaster became multithreaded 2025-01-23 03:55:06.243 GM |
Previous Message | David G. Johnston | 2025-01-22 16:43:07 | Re: BUG #18774: Not the required output of the query used in the function(delete_from_table1) in postgresql9.4 |