Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-11-11 03:08:21
Message-ID: 20211111030821.2xxl75qfkynaga4j@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2021-11-09 19:07:02 -0800, Peter Geoghegan wrote:
> diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
> index c7331d810..31fa4d2a3 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -31,6 +31,7 @@
> typedef struct
> {
> Relation rel;
> + BlockNumber targetblkno;

It's a good optimization to remove the reundant BufferGetBlockNumber(), but
perhaps it's worth committing that separately?

> @@ -256,10 +260,8 @@ heap_page_prune(Relation relation, Buffer buffer,
> offnum <= maxoff;
> offnum = OffsetNumberNext(offnum))
> {
> - ItemId itemid;
> -
> /* Ignore items already processed as part of an earlier chain */
> - if (prstate.marked[offnum])
> + if (prstate.fromvalidchain[offnum])
> continue;
>
> /*

ISTM that we should now call heap_prune_chain() only for the start of actual
chains (i.e. redirect or non-hot tuple).

I think it'd be good to have a comment above the loop explaining that we're
just iterating over chains starting at a non-HOT element.

> @@ -269,15 +271,29 @@ heap_page_prune(Relation relation, Buffer buffer,
> if (off_loc)
> *off_loc = offnum;
>
> - /* Nothing to do if slot is empty or already dead */
> - itemid = PageGetItemId(page, offnum);
> - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> - continue;
> -
> /* Process this item or chain of items */
> ndeleted += heap_prune_chain(buffer, offnum, &prstate);
> }

Why did you move this into heap_prune_chain()?

> + /*
> + * Scan the page again, processing any heap-only tuples that we now know
> + * are not part of any valid HOT chain
> + */
> + for (offnum = FirstOffsetNumber;
> + offnum <= maxoff;
> + offnum = OffsetNumberNext(offnum))
> + {
> + /* Ignore items already processed as part of a known valid chain */
> + if (prstate.fromvalidchain[offnum])
> + continue;
> +
> + if (off_loc)
> + *off_loc = offnum;
> +
> + /* Process this disconnected heap-only tuple */
> + ndeleted += heap_prune_heaponly(buffer, offnum, &prstate);
> + }
> +
> /* Clear the offset information once we have processed the given page. */
> if (off_loc)
> *off_loc = InvalidOffsetNumber;
> @@ -383,19 +399,8 @@ heap_page_prune(Relation relation, Buffer buffer,
> pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
>
> /*
> - * XXX Should we update the FSM information of this page ?
> - *
> - * There are two schools of thought here. We may not want to update FSM
> - * information so that the page is not used for unrelated UPDATEs/INSERTs
> - * and any free space in this page will remain available for further
> - * UPDATEs in *this* page, thus improving chances for doing HOT updates.
> - *
> - * But for a large table and where a page does not receive further UPDATEs
> - * for a long time, we might waste this space by not updating the FSM
> - * information. The relation may get extended and fragmented further.
> - *
> - * One possibility is to leave "fillfactor" worth of space in this page
> - * and update FSM with the remaining space.
> + * Don't update the FSM information on this page. We leave it up to our
> + * caller to decide what to do about that.
> */

I don't think this should be part of the commit?

> offnum = rootoffnum;
> + nchain = 0;

> /* while not end of the chain */
> for (;;)
> {
> ItemId lp;
> - bool tupdead,
> - recent_dead;
> + HeapTupleHeader htup;
> + HeapTupleData tup;
> + bool tupdead;
>
> /* Sanity check (pure paranoia) */
> if (offnum < FirstOffsetNumber)
> @@ -601,15 +603,11 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> break;
>
> /* If item is already processed, stop --- it must not be same chain */
> - if (prstate->marked[offnum])
> + if (nchain != 0 && prstate->fromvalidchain[offnum])
> break;

I think we should make it an error to reach the same tuple multiple ways. It's
not *quite* as trivial as making this an assert/error though, as we can only
raise an error after checking that the xmin/xmax thing passes.

> /*
> * If we are looking at the redirected root line pointer, jump to the
> * first normal tuple in the chain. If we find a redirect somewhere
> @@ -619,42 +617,63 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> {
> if (nchain > 0)
> break; /* not at start of chain */
> + Assert(rootisredirect);
> chainitems[nchain++] = offnum;
> offnum = ItemIdGetRedirect(rootlp);
> continue;
> }

Why are we handling this inside the loop?

> - /*
> - * Likewise, a dead line pointer can't be part of the chain. (We
> - * already eliminated the case of dead root tuple outside this
> - * function.)
> - */
> - if (ItemIdIsDead(lp))
> + /* LP_UNUSED or LP_DEAD items obviously not part of the chain */
> + if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp))
> + {
> + /*
> + * XXX What if we just came from root item, which is a plain heap
> + * tuple? Do we need assert that notices when we reach an LP_DEAD
> + * or LP_UNUSED item having started from such a root item?
> + */
> + Assert(!rootisredirect || nchain > 1);

I don't think we can assert that. It's perfectly possible to have a non-hot
update chain where the following element can be vacuumed, but the preceding
element can't (e.g. when the updater has an older xid that prevents it from
being pruned).

> chainitems[nchain++] = offnum;
> + prstate->fromvalidchain[offnum] = true;
>
> /*
> * Check tuple's visibility status.
> */
> - tupdead = recent_dead = false;
> + tupdead = false;
>
> switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
> {
> @@ -663,7 +682,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> break;
>
> case HEAPTUPLE_RECENTLY_DEAD:
> - recent_dead = true;
>
> /*
> * This tuple may soon become DEAD. Update the hint field so
> @@ -679,6 +697,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> * This tuple may soon become DEAD. Update the hint field so
> * that the page is reconsidered for pruning in future.
> */
> + Assert(!tupdead);
> heap_prune_record_prunable(prstate,
> HeapTupleHeaderGetUpdateXid(htup));
> break;

> @@ -692,6 +711,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> * But we don't. See related decisions about when to mark the
> * page prunable in heapam.c.
> */
> + Assert(!tupdead);
> break;
>
> default:

I don't understand these new assert? We just set tupdead to false above?

> @@ -700,11 +720,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> }
>
> /*
> - * Remember the last DEAD tuple seen. We will advance past
> - * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
> - * but we can't advance past anything else. We have to make sure that
> - * we don't miss any DEAD tuples, since DEAD tuples that still have
> - * tuple storage after pruning will confuse VACUUM.
> + * Remember the last DEAD tuple seen from this HOT chain.
> + *
> + * We expect to sometimes find a RECENTLY_DEAD tuple after a DEAD
> + * tuple. When this happens, the RECENTLY_DEAD tuple will be treated
> + * as if it was DEAD all along. Manage that now.

I now actually wonder why this is correct. There's another comment about it:

> We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really

But that doesn't justify very much.

What prevents the scenario that some other backend e.g. has a snapshot with
xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
row, but *not* the RECENTLY_DEAD one.

> + * We're not just interested in DEAD and RECENTLY_DEAD tuples. We
> + * need to traverse each and every HOT chain exhaustively, in order to
> + * determine which heap-only tuples are part of a valid HOT chain.
> + * Heap-only tuples that cannot be located like this must not be part
> + * of a valid HOT chain. They are therefore processed during our
> + * second pass over the page.
> */
> if (tupdead)
> {

This comment can't really be understood without the historical behaviour of
the function.

> +static int
> +heap_prune_heaponly(Buffer buffer, OffsetNumber offnum, PruneState *prstate)
> +{
> + Page dp = (Page) BufferGetPage(buffer);
> + ItemId lp;
> + HeapTupleHeader htup;
> + HeapTupleData tup;
> + HTSV_Result res;
> +
> + lp = PageGetItemId(dp, offnum);
> + Assert(ItemIdIsNormal(lp));
> + htup = (HeapTupleHeader) PageGetItem(dp, lp);
> +
> + /*
> + * Caller must make sure that the tuple at 'offnum' is in fact a heap-only
> + * tuple that is disconnected from its HOT chain (i.e. isn't reachable
> + * through a HOT traversal that starts at any plausible-looking root item
> + * on the page).
> + */
> + Assert(!prstate->fromvalidchain[offnum]);
> + Assert(HeapTupleHeaderIsHeapOnly(htup));
> +
> + /*
> + * We expect that disconnected heap-only tuples must be from aborted
> + * transactions. Any RECENTLY_DEAD tuples we see here are really DEAD,
> + * but the heap_prune_satisfies_vacuum test is too coarse to detect it.
> + */
> + tup.t_len = ItemIdGetLength(lp);
> + tup.t_tableOid = RelationGetRelid(prstate->rel);
> + tup.t_data = htup;
> + ItemPointerSet(&(tup.t_self), prstate->targetblkno, offnum);
> + res = heap_prune_satisfies_vacuum(prstate, &tup, buffer);
> + if (res == HEAPTUPLE_DEAD || res == HEAPTUPLE_RECENTLY_DEAD)
> + {
> + heap_prune_record_unused(prstate, offnum);
> + HeapTupleHeaderAdvanceLatestRemovedXid(htup,
> + &prstate->latestRemovedXid);
> + }
> + else
> + Assert(false);
> +
> + /*
> + * Should always be DEAD. A DEAD heap-only tuple is always counted in
> + * top-level ndeleted counter for pruning operation.
> + */
> + return 1;
> +}

It seems weird to have the if (res == HEAPTUPLE_DEAD ..) branch, but then to
unconditionally return 1.

I'm not actually sure the Assert is unreachable. I can imagine cases where
we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent subtransaction
abort or such.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-11-11 03:16:41 Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Previous Message Michael Paquier 2021-11-11 02:52:27 Re: BUG #17280: global-buffer-overflow on select from pg_stat_slru