Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>
Subject: Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Date: 2024-07-15 22:01:43
Message-ID: CAH2-WzmXqaX37VCVjTDBF5cauzUgXOkuXsSTLOq56EAB1vzkeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> Attached v3 has one important additional component in the test -- I
> use pg_stat_progress_vacuum to confirm that we actually do more than
> one pass of index vacuuming. Otherwise, it would have been trivial for
> the test to incorrectly pass.

That's a good idea.

> I could still use another pair of eyes on the test (looking out for
> stability enhancing measures I could take).

First, the basics: I found that your test failed reliably without your
fix, and passed reliably with your fix.

Next, performance: the total test runtime (as indicated by "time meson
test -q recovery/043_vacuum_horizon_floor") was comparable to other
recovery/* TAP tests. The new vacuum_horizon_floor test is a little on
the long running side, as these tests go, but I think that's fine.

Minor nitpicking about the comments in your TAP test:

* It is necessary but not sufficient for your test to "skewer"
maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
because the test can only fail when VACUUM prunes a heap page after
the backend's horizons have been "skewered" in this sense.

Pruning is when we get stuck, and if there's no more pruning then
there's no opportunity for VACUUM to get stuck. Perhaps this point
should be noted directly in the comments. You could add a sentence
immediately after the existing sentence "Then vacuum's first pass will
continue and pruning...". This new sentence would then add commentary
such as "Finally, vacuum's second pass over the heap...".

* Perhaps you should point out that you're using VACUUM FREEZE for
this because it'll force the backend to always get a cleanup lock.
This is something you rely on to make the repro reliable, but that's
it.

In other words, point out to the reader that this bug has nothing to
do with freezing; it just so happens to be convenient to use VACUUM
FREEZE this way, due to implementation details.

* The sentence "VACUUM proceeds with pruning and does a visibility
check on each tuple..." describes the bug in terms of the current
state of things on Postgres 17, but Postgres 17 hasn't been released
just yet. Does that really make sense?

If you want to describe the invariant that caused
heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
commit message of your fix seems like the right place for that. You
could reference these errors in passing. The errors seem fairly
incidental to the real problem, at least to me.

I think that there is some chance that this test will break the build
farm in whatever way, since there is a long history of VACUUM not
quite behaving as expected with these sorts of tests. I think that you
should commit the test case separately, first thing in the morning,
and then keep an eye on the build farm for the rest of the day. I
don't think that it's sensible to bend over backwards, just to avoid
breaking the build farm in this way.

Thanks for working on this
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-07-15 22:48:37 Re: CI, macports, darwin version problems
Previous Message Andres Freund 2024-07-15 21:58:16 Re: Upgrade Debian CI images to Bookworm