Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, "Wood, Dan" <hexpert(at)amazon(dot)com>, "Wong, Yi Wen" <yiwong(at)amazon(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))
Date: 2017-10-16 19:57:39
Message-ID: CAH2-Wz=4C2_m=EKZxuJRwh_hTVgLzaaussNNxeh_Oi_QxS9Spw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 13, 2017 at 7:09 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> All good questions; I don't know offhand. Discovering those answers is
> perhaps the chief labor required of such a project.

ISTM that by far the hardest part of the project is arriving at a
consensus around what a good set of invariants for CLOG and MultiXact
looks like.

I think that it's fair to say that this business with relfrozenxid now
appears to be more complicated than many of us would have thought. Or,
at least, more complicated than I thought when I first started
thinking about it. Once we're measuring this complexity (by having
checks), we should be in a better position to keep it under control,
and to avoid ambiguity.

> The checker should
> consider circumstances potentially carried from past versions via pg_upgrade.

Right. False positives are simply unacceptable.

> Fortunately, if you get some details wrong, it's cheap to recover from checker
> bugs.

Ideally, amcheck will become a formal statement of the contracts
provided by major subsystems, such as the heapam, the various SLRUs,
and so on. While I agree that having bugs there is much less severe
than having bugs in backend code, I would like the tool to reach a
point where it actually *defines* correctness (by community
consensus). If a bug in amcheck reflects a bug in our high level
thinking about correctness, then that actually is a serious problem.
Arguably, it's the most costly variety of bug that Postgres can have.

I may never be able to get general buy-in here; building broad
consensus like that is a lot harder than writing some code for a
contrib module. Making the checking code the *authoritative* record of
how invariants are *expected* to work is a major goal of the project,
though.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-16 20:01:48 Re: Still another race condition in recovery TAP tests
Previous Message Joshua D. Drake 2017-10-16 17:03:52 Re: Determine state of cluster (HA)