From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #17821: Assertion failed in heap_update() due to heap pruning |
Date: | 2025-04-02 17:04:36 |
Message-ID: | 20250402170436.4a.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, Mar 03, 2025 at 07:34:42PM -0800, Noah Misch wrote:
> On Mon, Mar 03, 2025 at 05:07:10PM -0500, Andres Freund wrote:
> > I just fixed skink, the valgrind animal, so it runs not just the main
> > regression tests but all tests with a valgrind-ified postgres.
>
> Thanks.
>
> > Unfortunately,
> > the next run triggered a failure in the test added in this thread:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2025-03-03%2019%3A52%3A38&stg=injection_points-check
> >
> > diff -U3 /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out
> > --- /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/modules/injection_points/expected/syscache-update-pruned.out 2025-01-25 19:30:50.005386842 +0000
> > +++ /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/injection_points/isolation/results/syscache-update-pruned.out 2025-03-03 21:08:02.025314915 +0000
> > @@ -75,6 +75,7 @@
> > SELECT FROM injection_points_wakeup('heap_update-before-pin');
> > <waiting ...>
> > step grant1: <... completed>
> > +ERROR: tuple concurrently deleted
> > step wakegrant4: <... completed>
> > step inspect4:
> > SELECT relhastriggers, relhassubclass FROM pg_class
> > @@ -82,6 +83,6 @@
> >
> > relhastriggers|relhassubclass
> > --------------+--------------
> > -f |f
> > +t |t
> > (1 row)
>
> That isolation permutation tests an unfixed bug. Here, it's giving a result
> as though the bug were fixed. The suite passed the next two skink runs. I'd
> tend to defer debugging exactly what went wrong until the project to fix the
> bug under test. I could delete the permutation, or I could leave it awhile to
> see how high-probability this failure is. I'm inclined to leave it until it
> gets four failures, then delete the permutation.
Alexander Lakhin shared a recipe that reproduced it. I found permutation 2
also failed under that recipe, so removing permutation 3 got less attractive.
These tests need a concept of "wait until VACUUM definitely will prune a
particular tuple". My removable_cutoff() header comment made an argument for
having achieved that, but it had two gaps. The attached patch passes 1000
Valgrind iterations of the spec. Without the patch, ~3% of iterations failed.
Commit c2dc1a7 added DISABLE_PAGE_SKIPPING to stop buffer pins thwarting other
VACUUM tests. However, I suspect FREEZE is necessary and sufficient in all
these tests, since we need to coax this code to insist on the cleanup lock:
/*
* If we didn't get the cleanup lock, we can still collect LP_DEAD
* items in the dead_items area for later vacuuming, count live and
* recently dead tuples for vacuum logging, and determine if this
* block could later be truncated. If we encounter any xid/mxids that
* require advancing the relfrozenxid/relminxid, we'll have to wait
* for a cleanup lock and call lazy_scan_prune().
*/
if (!got_cleanup_lock &&
!lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
{
/*
* lazy_scan_noprune could not do all required processing. Wait
* for a cleanup lock, and call lazy_scan_prune in the usual way.
*/
Assert(vacrel->aggressive);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBufferForCleanup(buf);
got_cleanup_lock = true;
}
FREEZE causes the VACUUM to "require advancing the relfrozenxid/relminxid",
hence the effectiveness of FREEZE for this purpose. The code has moved
around, but I think the policy was the same at the time of that commit
c2dc1a7. Per the doc for DISABLE_PAGE_SKIPPING, it's for working around a
corrupt vm, not for pin contention. The spec still passes 1000 Valgrind
iterations if I remove DISABLE_PAGE_SKIPPING.
Attachment | Content-Type | Size |
---|---|---|
update-unused-lp-60-dpageskip-v1.patch | text/plain | 9.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2025-04-02 19:00:02 | BUG #18875: COPY BINARY tsvector FROM file leads to misaligned memory access |
Previous Message | Álvaro Herrera | 2025-04-01 14:43:34 | Re: BUG #18874: CVE-2024-4317 |