Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-10 16:03:06
Message-ID: CAH2-Wz=nJ93NtbqT966_gmpVRekd4=Y9Lngv1fMNzfeVwdM+5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 10, 2021 at 8:49 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> Could you elaborate on what this "matches what we expect" entails?
>
> Apart from this, I'm also quite certain that the goto-branch that
> created this infinite loop should have been dead code: In a correctly
> working system, the GlobalVis*Rels should always be at least as strict
> as the vacrel->OldestXmin, but at the same time only GlobalVis*Rels
> can be updated (i.e. move their horizon forward) during the vacuum. As
> such, heap_prune_satisfies_vacuum should never fail to vacuum a tuple
> that also satisifies the condition of HeapTupleSatisfiesVacuum.

It's true that these two similar functions should be in perfect
agreement in general (given the same OldestXmin). That in itself
doesn't mean that they must always agree about a tuple in practice,
when they're called in turn inside lazy_scan_prune(). In particular,
nothing stops a transaction that was in progress to
heap_prune_satisfies_vacuum (when it saw some tuples it inserted)
concurrently aborting. That will render the same tuples fully DEAD
inside HeapTupleSatisfiesVacuum(). So we need to restart using the
goto purely to cover that case. See the commit message of commit
8523492d4e3.

By "matches what we expect", I meant "involves a just-aborted
transaction". We could defensively verify that the inserting
transaction concurrently aborted at the point of retrying/calling
heap_page_prune() a second time. If there is no aborted transaction
involved (as was the case with this bug), then we can be confident
that something is seriously broken.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-06-10 16:04:20 Re: "an SQL" vs. "a SQL"
Previous Message David Rowley 2021-06-10 15:52:46 Re: "an SQL" vs. "a SQL"