Re: What is "wraparound failure", really?

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: What is "wraparound failure", really?
Date: 2021-06-30 17:43:24
Message-ID: CAH2-Wz=seunV_jxRbA-eF1bvW5C2p=Ai+3O6juccOLZ7hvLKow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 30, 2021 at 6:46 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> The problem is that the setting is measuring something that is a
> pretty poor proxy for the thing we actually care about. It's measuring
> the XID age at which we're going to start forcing vacuums on tables
> that don't otherwise need to be vacuumed, but the thing we care about
> is the XID age at which those vacuums are going to *finish*. Now maybe
> you think that's a minor difference, and if your tables are small, it
> is, but if they're really big, it's not. If you have only tables that
> are say 1GB in size and your system is otherwise well-configured, you
> could probably crank autovacuum_freeze_max_age up all the way to the
> max without a problem. But if you have 1TB tables, you are going to
> need a lot more headroom.

I 100% agree with all of that. However, I can't help but notice that
your argument seems to work best as an argument against how freezing
works in general. The scheduling is way too complex because we're
fundamentally trying to model something that is way too complex and
nonlinear by its very nature. It's true that we can do a better job by
continually updating our understanding of the state of the system
dynamically, during each VACUUM. But maybe we should get rid of
freezing instead. Is it really so hard to do that, in the grand scheme
of things?

We have tuple freezing because we need it to solve a problem with the
"physical database" (not the "logical database"). Namely the problem
of having 32-bit XIDs in tuple headers when 64-bit XIDs are
theoretically what we need. I'm not actually in favor of 64-bit XIDs
in tuple headers (or anything like it), but I am in favor of at least
solving the problem with a true "physical database" level solution.
The definition of freezing unnecessarily couples how we handle the XID
issue with GC by VACUUM, which makes everything much more fragile. A
frozen tuple must necessarily be visible to any possible MVCC
snapshot. That's really fragile, in many different ways. It's also
unnecessary.

Why should XID wraparound be a problem for the entire system? Why not
just make it a problem for any very old MVCC snapshots that are
*actually* about to be affected? Some kind of "snapshot too old"
approach seems quite possible. I think that we can do a lot better
than freezing within the confines of the current heapam design (or the
design prior to the introduction of freezing ~20 years ago). Once
aborted XIDs are removed eagerly, a strict "logical vs physical"
separation of concerns can be imposed.

I'm sorry to go on about this again and again, but it really does seem
related to what you're saying. The current freezing design is hard to
model because it's inherently fragile.

> I think what we really need here is some kind of deadline-based
> scheduler. As Peter says, the problem is that we might run out of
> XIDs. The system should be constantly thinking about that and taking
> appropriate emergency actions to make sure it doesn't happen. Right
> now it's really pretty chill about the possibility of looming
> disaster. Imagine that you hire a babysitter and tell them to get the
> kids out of the house if there's a fire. While you're out, a volcano
> erupts down the block. A giant cloud of ash forms and there's lava
> everywhere, even touching the house, which begins to smolder, but the
> babysitter just sits there and watches TV. As soon as the first flames
> appear, the babysitter stops watching TV, gets the kids, and tries to
> leave the premises. That's our autovacuum scheduler! It has no
> inclination or ability to see the future; it makes decisions entirely
> based on the present state of things. In a lot of cases that's OK, but
> sometimes it leads to a completely ridiculous outcome.

Yeah, it's still pretty absurd, even with the failsafe.

To extend your analogy, in the real world the babysitter can afford to
make very conservative assumptions about whether or not the house is
about to catch fire. In practice the chances of that happening on any
given day are certainly very low -- it'll probably never come close to
happening even once. And there is an inherent asymmetry, since of
course the cost of a false positive is that the friends reunion
episode is unnecessarily cut short, which is totally inconsequential
compared to the cost of a false negative. If there wasn't such a big
asymmetry then what we'd probably do is not even think about what the
babysitter does -- we just wouldn't care at all.

Anyway, I'll try to come up with a way of rewording this section of
the docs that mostly preserves its existing structure, but makes it
possible to talk about the failsafe. The current structure of this
section of the docs is needlessly ambiguous, but I think that that can
be fixed without changing too much. FWIW I have heard things that
suggest that some users believe that modern PostgreSQL can actually
allow "the past to look like the future" in some cases -- probably
because of the wording here. This area of the system certainly is
scary, but it's not quite that scary.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-06-30 18:20:50 Re: SSL/TLS instead of SSL in docs
Previous Message Antonin Houska 2021-06-30 17:41:16 Re: POC: Cleaning up orphaned files using undo logs