Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-11-12 00:58:49
Message-ID: CAH2-Wz=fUgig7oJjvNEVToFy-G=HWzsbq13zkePsVv2tuk9wgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Nov 10, 2021 at 7:08 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> It's a good optimization to remove the reundant BufferGetBlockNumber(), but
> perhaps it's worth committing that separately?

Ordinarily I would, but the intention here is to avoid new performance
regressions. Is it okay if I backpatch this performance optimization,
trivial though it may be? Or should I just do that on HEAD?

Attached is v3, which works though some of your feedback, though there
are still unresolved issues. I expect that we'll need to very
carefully review this bugfix, since it's a particularly tricky issue.
I expect to produce several more revisions.

> > @@ -256,10 +260,8 @@ heap_page_prune(Relation relation, Buffer buffer,
> > offnum <= maxoff;
> > offnum = OffsetNumberNext(offnum))
> > {
> > - ItemId itemid;
> > -
> > /* Ignore items already processed as part of an earlier chain */
> > - if (prstate.marked[offnum])
> > + if (prstate.fromvalidchain[offnum])
> > continue;
> >
> > /*
>
> ISTM that we should now call heap_prune_chain() only for the start of actual
> chains (i.e. redirect or non-hot tuple).

Seems easier to put the "heap-only tuple at rootoffnum, so it might be
disconnected -- not for me to process unless and until I find it
during later call that uses this item's HOT chain's root offset
number" logic here. That's at least closer to what we have already.
Let me know if you still don't like it in v3, though. I might just
change it, but it's related to the way that the array is now called
"visited" -- a question that I'll consider in the paragraph
immediately following this one.

> I think it'd be good to have a comment above the loop explaining that we're
> just iterating over chains starting at a non-HOT element.

You mean like heap_get_root_tuples() does it, with its own root_offsets array?

I have changed the array -- it's now "visited". It's no longer
specifically about HOT chains. It's actually more like "definitely not
disconnected heap-only tuple" items, but I didn't call it that for the
obvious reason. By the end of the first pass over the heap, all items
that are not heap-only tuples (as well as most heap-only tuples) will
be marked visited in v3.

The advantage of not explicitly making it about root-ness is that it
avoids relying on the already-squishy definition of "root item". We
want to visit an LP_DEAD or LP_UNUSED item in the first pass over the
page -- we must get them out of the way during the second pass (we
don't expect to have to deal with any tuple that isn't a disconnected
heap-only tuple at that point).

Another advantage of the "visited" array design is that it doesn't
matter if we happen to visit an LP_DEAD/LP_UNUSED item via a heap-only
tuples bogus-but-valid t_ctid chain. Might as well just mark it
"visited' at that point, avoiding a second visit during the first heap
pass (in addition to the second pass). That micro optimization may not
be worth much, but it feels kind of natural to me. What do you think?

> > @@ -269,15 +271,29 @@ heap_page_prune(Relation relation, Buffer buffer,
> > if (off_loc)
> > *off_loc = offnum;
> >
> > - /* Nothing to do if slot is empty or already dead */
> > - itemid = PageGetItemId(page, offnum);
> > - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
> > - continue;
> > -
> > /* Process this item or chain of items */
> > ndeleted += heap_prune_chain(buffer, offnum, &prstate);
> > }
>
> Why did you move this into heap_prune_chain()?

I find it simpler that way. Now both the first and second pass loops
have essentially the same structure as each other.

> >
> > /*
> > - * XXX Should we update the FSM information of this page ?
> > - *

> I don't think this should be part of the commit?

Agreed. Pushed that as a separate commit just now.

> I think we should make it an error to reach the same tuple multiple ways. It's
> not *quite* as trivial as making this an assert/error though, as we can only
> raise an error after checking that the xmin/xmax thing passes.

But as you go on to say, we should *not* do that for LP_DEAD or
LP_UNUSED items (nor heap-only tuples that don't pass xmin/xmax
crosscheck) -- since in general a heap-only tuples's t_ctid link is
allowed to point to just about anything (e.g., past the end of a heap
page's line pointer array).

Wouldn't it be simpler and more robust if we leaned on
heap_prune_satisfies_vacuum() for this instead? There will end up
being some "disconnected" remaining heap-only tuples when the same
tuple is reachable multiple ways. It's very likely that these
apparently-disconnected-but-who-knows tuples will fail some kind of
validation by heap_prune_satisfies_vacuum() in the second pass over
the page -- we expect to be able to treat them DEAD, which provides a
good cross-check. (Granted, there is an unresolved question about what
exact HTSV return codes we can expect here, but I imagine that a LIVE
disconnected tuple will be fully impossible, for example).

Note also that v3 directly Asserts() for the truly important case: The
case where a HOT chain whose root item is an LP_REDIRECT has only one
valid-looking item (the root item itself). That assertion is near the
end of heap_prune_chain() in v3. This assertion more or less replaces
the very dangerous "We found a redirect item that doesn't point to a
valid follow-on" block from HEAD.

> > /*
> > * If we are looking at the redirected root line pointer, jump to the
> > * first normal tuple in the chain. If we find a redirect somewhere
> > @@ -619,42 +617,63 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> > {
> > if (nchain > 0)
> > break; /* not at start of chain */
> > + Assert(rootisredirect);
> > chainitems[nchain++] = offnum;
> > offnum = ItemIdGetRedirect(rootlp);
> > continue;
> > }
>
> Why are we handling this inside the loop?

Removed in v3.

> > + if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp))
> > + {
> > + /*
> > + * XXX What if we just came from root item, which is a plain heap
> > + * tuple? Do we need assert that notices when we reach an LP_DEAD
> > + * or LP_UNUSED item having started from such a root item?
> > + */
> > + Assert(!rootisredirect || nchain > 1);
>
> I don't think we can assert that.

Agreed, it has been removed in v3.

> I don't understand these new assert? We just set tupdead to false above?

I don't either. Removed in v3.

> I now actually wonder why this is correct. There's another comment about it:
>
> > We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> > * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
>
> But that doesn't justify very much.
>
> What prevents the scenario that some other backend e.g. has a snapshot with
> xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> row, but *not* the RECENTLY_DEAD one.

I'll need to think about this very carefully. I didn't think it was
worth blocking v3 on, though naturally it's a big concern.

> This comment can't really be understood without the historical behaviour of
> the function.

Agreed. Removed in v3.

> > + if (res == HEAPTUPLE_DEAD || res == HEAPTUPLE_RECENTLY_DEAD)
> > + {
> > + heap_prune_record_unused(prstate, offnum);
> > + HeapTupleHeaderAdvanceLatestRemovedXid(htup,
> > + &prstate->latestRemovedXid);
> > + }
> > + else
> > + Assert(false);

> It seems weird to have the if (res == HEAPTUPLE_DEAD ..) branch, but then to
> unconditionally return 1.

I changed that in v3.

> I'm not actually sure the Assert is unreachable. I can imagine cases where
> we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent subtransaction
> abort or such.

I don't know either. I am leaving this question unresolved for now.

--
Peter Geoghegan

Attachment Content-Type Size
v3-0001-Fix-aborted-HOT-update-bug-in-heap-pruning.patch application/octet-stream 20.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2021-11-12 02:14:48 Re: BUG #17245: Index corruption involving deduplicated entries
Previous Message Tom Lane 2021-11-11 20:49:46 Re: BUG #17281: How specify regress database?