Re: post-recovery amcheck expectations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: Re: post-recovery amcheck expectations
Date: 2023-10-09 23:46:26
Message-ID: CAH2-Wz=e4WihDXD6-aKW7dgVyQ0dNcnW5BWJZBKvw7Mmrcv59Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c):
>
> * 1
> * / \
> * 2 <-> 3
>
> We're deleting 2, the leftmost leaf under a leftmost internal page. After the
> MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which
> still has a btpo_prev pointing to 2. bt_index_parent_check() complains here:

Thanks for working on this. Your analysis seems sound to me.

When I reviewed the patch that became commit d114cc53, I presented
Alexander with a test case that demonstrated false positive reports of
corruption involving interrupted page splits (not interrupted page
deletions). Obviously I didn't give sufficient thought to this case,
which is analogous.

Might make sense to test the fix for this issue using a similar
approach: by adding custom code that randomly throws errors at a point
that stresses the implementation. I'm referring to the point at which
VACUUM is between the first and second phase of page deletion: right
before (or directly after) _bt_unlink_halfdead_page() is called. That
isn't fundamentally different to the approach from your TAP test, but
seems like it might add some interesting variation.

> One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record
> and its corresponding UNLINK_PAGE record. See the attached test case. The
> index is actually fine in such a state, right?

Yes, it is fine.

FWIW, this feels like it might be related to the fact that (unlike
Lanin & Shasha), we don't make the key space move left; we make it
move right instead (just like page splits). In other words, page
deletion isn't the precise opposite of a page split, which is a bit
awkward.

Note, in particular, that _bt_mark_page_halfdead() doesn't do a
straight delete of the pivot tuple in the parent page that points to
the target page, as you might expect. It actually deletes the right
sibling of the target page's pivot, and then performs an in-place overwrite of
the downlink from the pivot tuple that originally pointed to the
target page. Perhaps this isn't worth going into now, but thought you
might appreciate the context.

Terminology note: we sometimes use "downlink" as a synonym of "pivot
tuple" or even "separator key", which is misleading.

> I lean toward fixing this by
> having amcheck scan left; if left links reach only half-dead or deleted pages,
> that's as good as the present child block being P_LEFTMOST.

Also my preference.

> There's a
> different error from bt_index_check(), and I've not yet studied how to fix
> that:
>
> ERROR: left link/right link pair in index "not_leftmost_pk" not in agreement
> DETAIL: Block=0 left block=0 left link from block=4294967295.

This looks like this might be a straightforward case of confusing
P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
indicate "no such block" here.

> Alternatively, one could view this as a need for the user to VACUUM between
> recovery and amcheck. The documentation could direct users to "VACUUM
> (DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since
> last recovery. Does anyone prefer that or some other alternative?

I'd rather not go that route. That strikes me as defining the problem
out of existence.

> For some other amcheck expectations, the comments suggest reliance on the
> bt_index_parent_check() ShareLock. I haven't tried to make test cases for
> them, but perhaps recovery can trick them the same way. Examples:
>
> errmsg("downlink or sibling link points to deleted block in index \"%s\"",
> errmsg("block %u is not leftmost in index \"%s\"",
> errmsg("block %u is not true root in index \"%s\"",

These are all much older. They're certainly all from before the
relevant checks were first added (by commit d114cc53), and seem much
less likely to be buggy.

These older cases are all cases where we descend directly from an
internal page to one of its child pages. Whereas the problem you've
demonstrated involves traversal across levels *and* across siblings in
newer code. That's quite a bit more complicated, since it requires
that we worry about both phases of page deletion -- not just the
first. That in itself necessitates that we deal with various edge
cases. (The really prominent edge-case is the interrupted page
deletion case, which requires significant handling, but evidently
missed a subtlety with leftmost pages).

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-10-09 23:48:27 Re: CREATE DATABASE with filesystem cloning
Previous Message Bruce Momjian 2023-10-09 23:45:16 Re: Lowering the default wal_blocksize to 4K