Re: BUG #17821: Assertion failed in heap_update() due to heap pruning

From: Noah Misch <noah(at)leadboat(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17821: Assertion failed in heap_update() due to heap pruning
Date: 2024-11-13 00:26:57
Message-ID: 20241113002657.ce@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Apr 09, 2024 at 10:00:00PM +0300, Alexander Lakhin wrote:
> 09.04.2024 20:41, Melanie Plageman wrote:
> > Just to confirm, the original bug was filed against 15, but I see
> > ad98fb142 was committed on master as well as backpatched. You are able
> > to reproduce your pg_subscription example on master? I tried your
> > repro (including the modification to minfree) and wasn't able to
> > reproduce on master on my machine. This could totally be my fault, but
> > I just wanted to confirm you were able to repro on master.
>
> Yes, I reproduced both cases on master (27074bce0).

This looked a lot like the former LP_UNUSED race conditions that inplace.spec
demonstrated. I forked inplace.spec to use regular MVCC updates (attached).
That reproduces the symptoms you describe. This happens because syscache is a
way to get a tuple without a pin on the tuple's page and without satisfying
any snapshot. Either a pin or a snapshot would stop the pruning. That's how
regular UPDATE statements avoid this.

Some options for fixing it:

1. Make the ItemIdIsNormal() check a runtime check that causes heap_update()
to return TM_Deleted. It could Assert(RelationSupportsSysCache(rel)), too.

2. Like the previous, but introduce a TM_Pruned. If pruning hadn't happened,
heap_update() would have returned TM_Updated or TM_Deleted. We don't know
which one heap_update() would have returned, because pruning erased the
t_ctid that would have driven that decision.

3. Before using a syscache tuple for a heap_update(), pin its buffer and
compare the syscache xmin to the buffer tuple xmin. Normally, they'll
match. If they don't match, ERROR or re-fetch the latest tuple as though
the syscache lookup had experienced a cache miss.

4. Take a self-exclusive heavyweight lock before we use the syscache to fetch
the old tuple for a heap_update(). That's how we avoid this if the only
catalog-modifying commands are ALTER TABLE.

I think this defect can't corrupt catalogs or other data. It's merely a
source of transient failure. While one of the test permutations does update a
pg_class tuple of a different rel, the update then fails with a primary key
violation. All misdirected updates will fail that way, because we don't
update the primary key columns of syscache relations. With assertions
disabled, I think heap_update() looks for an old tuple at lp_off=0, hence
treating the page header as a tuple header. Bytes of pd_lsn become the
putative xmin. The attached test gets "ERROR: attempted to update invisible
tuple" error from that, but other PageHeader bit patterns might reach "could
not access status of transaction", unique constraint violations, etc.

I recommend (2) or (1), since they're self-contained. (3) or (4) potentially
give a better user experience for this rare failure, but that's code churn at
every DDL heap_update() site. How do you see it?

Thanks,
nm

Attachment Content-Type Size
repro-ItemIdIsNormal-v1.patch text/plain 10.1 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Jeff Davis 2024-11-13 07:18:18 Re: HashAgg degenerate case
Previous Message Jeff Davis 2024-11-12 23:15:55 Re: HashAgg degenerate case