Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Fixes for two separate bugs in nbtree VACUUM's page deletion
Date: 2020-04-28 16:17:11
Message-ID: CAH2-Wzk6A53e6EZaJ4nnan3OCmhWXLZqOPNqf5GehNxQtrz0Wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
> I agree with both patches.

Thanks for the review.

> For the first fix it seems better to push down the logic to the page
> deletion code as your 0001 patch does so. The following change changes
> the page deletion code so that it emits LOG message indicating the
> index corruption when a deleted page is passed. But we used to ignore
> in this case.

> I agree with this change but since I'm not sure this change directly
> relates to this bug I wonder if it can be a separate patch.

I am not surprised that you are concerned about this. That was the
hardest decision I had to make when writing the patch.

The central idea in the 0001-* patch (as I say at the end of the
commit message) is that we have a strict rule about where we handle
maintaining the oldest btpo.xact for already-deleted pages, and where
we handle it for pages that become deleted. The rules is this:
btvacuumpage() is strictly responsible for the former category (as
well as totally live pages), while _bt_pagedel() is responsible for
the latter category (this includes pages that started out as half-dead
but become deleted).

In order for this to work, _bt_pagedel() must not ever see a deleted
page that it did not delete itself. If I didn't explicitly take a
strict position on this, then I suppose that I'd have to have similar
handling for maintaining the oldest btpo.xact for existing deleted
pages encountered within _bt_pagedel(). But that doesn't make any
sense, even with corruption, so I took a strict position. Even with
corruption, how could we encounter an *existing* deleted page in
_bt_pagedel() but not encounter the same page within some call to
btvacuumpage() made from btvacuumscan()? In other words, how can we
fail to maintain the oldest btpo.xact for deleted pages even if we
assume that the index has a corrupt page with sibling links that point
to a fully deleted page? It seems impossible, even in this extreme,
contrived scenario.

Then there is the separate question of the new log message about
corruption. It's not too hard to see why _bt_pagedel() cannot ever
access an existing deleted page unless there is corruption:

* _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply
never pass a deleted page -- it handles those directly.

* _bt_pagedel() can only access additional leaf pages to consider
deleting them by following the right sibling link of the
btvacuumpage() caller's leaf page (or possibly some page even further
to the right if it ends up deleting many leaf pages in one go).
Following right links and finding a deleted page can only happen when
there is a concurrent page deletion by VACUUM, since the sibling links
to the deleted page are changed by the second stage of deletion (i.e.
by the atomic actionat where the page is actually marked deleted,
within _bt_unlink_halfdead_page()).

* There cannot be a concurrent VACUUM, though, since _bt_pagedel()
runs in VACUUM. (Note that we already depend on this for correctness
in _bt_unlink_halfdead_page(), which has a comment that says "a
half-dead page cannot resurrect because there can be only one vacuum
process running at a time".)

The strict position seems justified. It really isn't that strict. I'm
not concerned about the new LOG message concerning corruption being
annoying or wrong

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Carlucci 2020-04-28 16:24:23 Re: PostgreSQL CHARACTER VARYING vs CHARACTER VARYING (Length)
Previous Message James Coleman 2020-04-28 16:16:47 Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays