Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-03-29 18:27:33
Message-ID: CA+TgmoYSM234TDJCyjAHch9igHP2tahXXENc8hBT+BHwcMkT8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Mar 29, 2024 at 1:17 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> FWIW I never thought that the order that we called
> vacuum_get_cutoffs() relative to when we call GlobalVisTestFor() was
> directly significant (though I did think that about the order that we
> attain VACUUM's rel_pages and the vacuum_get_cutoffs() call). I can't
> have thought that, because clearly GlobalVisTestFor() just returns a
> pointer, and so cannot directly affect backend local state.

Hmm, OK.

> It was clear that this is an important issue, from an early stage.
> Pre-release 14 had 2 or 3 bugs that all had the same symptom:
> lazy_scan_prune would loop forever. This was true even though each of
> the bugs had fairly different underlying causes (all tied to
> dc7420c2c). I figured that there might well be more bugs like that in
> the future.

Looks like you were right.

> I have every reason to believe that the remaining problems in this
> area are extremely rare. I wonder if it would make sense to focus on
> making the infinite loop behavior in lazy_scan_prune just throw an
> error.
>
> I now fear that that'll be harder than one might think. At the time
> that I added the looping behavior (in commit 8523492d), I believed
> that the only "legitimate" reason that it could ever be needed was the
> same reason why we needed the old tupgone behavior (to deal with
> concurrently-inserted tuples from transactions that abort in flight).
> But now I worry that it's actually protective, in some way that isn't
> generally understood. And so it might be that converting the retry
> into a hard error (e.g., erroring-out after MaxHeapTuplesPerPage
> retries) will create new problems.

It also sounds like it would boil down to "ERROR: our code sucks", so
count me as not a fan of that approach. As much as I don't like the
idea of significant changes to the back-branches, I think I like that
idea even less.

On the other hand, I also don't have an idea that I do like right now,
so it's probably too early to decide anything just yet. I'll try to
find more time to study this (and I hope others do the same).

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message walther 2024-03-30 14:05:19 Re: Building with musl in CI and the build farm
Previous Message Peter Geoghegan 2024-03-29 17:16:48 Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae