Quick Links

Re: BUG #17821: Assertion failed in heap_update() due to heap pruning

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc:	Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #17821: Assertion failed in heap_update() due to heap pruning
Date:	2024-11-13 00:26:57
Message-ID:	20241113002657.ce@rfd.leadboat.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Tue, Apr 09, 2024 at 10:00:00PM +0300, Alexander Lakhin wrote:
> 09.04.2024 20:41, Melanie Plageman wrote:
> > Just to confirm, the original bug was filed against 15, but I see
> > ad98fb142 was committed on master as well as backpatched. You are able
> > to reproduce your pg_subscription example on master? I tried your
> > repro (including the modification to minfree) and wasn't able to
> > reproduce on master on my machine. This could totally be my fault, but
> > I just wanted to confirm you were able to repro on master.
>
> Yes, I reproduced both cases on master (27074bce0).

This looked a lot like the former LP_UNUSED race conditions that inplace.spec
demonstrated. I forked inplace.spec to use regular MVCC updates (attached).
That reproduces the symptoms you describe. This happens because syscache is a
way to get a tuple without a pin on the tuple's page and without satisfying
any snapshot. Either a pin or a snapshot would stop the pruning. That's how
regular UPDATE statements avoid this.

Some options for fixing it:

1. Make the ItemIdIsNormal() check a runtime check that causes heap_update()
to return TM_Deleted. It could Assert(RelationSupportsSysCache(rel)), too.

2. Like the previous, but introduce a TM_Pruned. If pruning hadn't happened,
heap_update() would have returned TM_Updated or TM_Deleted. We don't know
which one heap_update() would have returned, because pruning erased the
t_ctid that would have driven that decision.

3. Before using a syscache tuple for a heap_update(), pin its buffer and
compare the syscache xmin to the buffer tuple xmin. Normally, they'll
match. If they don't match, ERROR or re-fetch the latest tuple as though
the syscache lookup had experienced a cache miss.

4. Take a self-exclusive heavyweight lock before we use the syscache to fetch
the old tuple for a heap_update(). That's how we avoid this if the only
catalog-modifying commands are ALTER TABLE.

I think this defect can't corrupt catalogs or other data. It's merely a
source of transient failure. While one of the test permutations does update a
pg_class tuple of a different rel, the update then fails with a primary key
violation. All misdirected updates will fail that way, because we don't
update the primary key columns of syscache relations. With assertions
disabled, I think heap_update() looks for an old tuple at lp_off=0, hence
treating the page header as a tuple header. Bytes of pd_lsn become the
putative xmin. The attached test gets "ERROR: attempted to update invisible
tuple" error from that, but other PageHeader bit patterns might reach "could
not access status of transaction", unique constraint violations, etc.

I recommend (2) or (1), since they're self-contained. (3) or (4) potentially
give a better user experience for this rare failure, but that's code churn at
every DDL heap_update() site. How do you see it?

Thanks,
nm

Attachment	Content-Type	Size
repro-ItemIdIsNormal-v1.patch	text/plain	10.1 KB

In response to

Re: BUG #17821: Assertion failed in heap_update() due to heap pruning at 2024-04-09 19:00:00 from Alexander Lakhin

Responses

Re: BUG #17821: Assertion failed in heap_update() due to heap pruning at 2025-01-11 21:44:54 from Noah Misch

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Jeff Davis	2024-11-13 07:18:18	Re: HashAgg degenerate case
Previous Message	Jeff Davis	2024-11-12 23:15:55	Re: HashAgg degenerate case