Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Date: 2022-01-19 19:53:59
Message-ID: CAH2-WzmT9N_42b=yuT-B03E=R48ECA6iEvR48305CSfwg=WUjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 19, 2022 at 6:56 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I don't think I've said anything on this thread that is an attack on
> you. I am getting pretty frustrated with the tenor of the discussion,
> though. I feel like you're the one attacking me, and I don't like it.

"Attack" is a strong word (much stronger than "defend"), and I don't
think I'd use it to describe anything that has happened on this
thread. All I said was that you misrepresented my views when you
pounced on my use of the word "continuous". Which, honestly, I was
very surprised by.

> For all of that, I'm not even convinced that you're wrong. I just
> think you might be wrong. I don't really know.

I agree that I might be wrong, though of course I think that I'm
probably correct. I value your input as a critical voice -- that's
generally how we get really good designs.

> However, there are cases where waiting, and only
> waiting, gets the job done. If you're not willing to admit that those
> cases exist, or you think they don't matter, then we disagree.

They exist, of course. That's why I don't want to completely eliminate
the idea of waiting for a cleanup lock. Rather, I want to change the
design to recognize that that's an extreme measure, that should be
delayed for as long as possible. There are many ways that the problem
could naturally resolve itself.

Waiting for a cleanup lock after only 50 million XIDs (the
vacuum_freeze_min_age default) is like performing brain surgery to
treat somebody with a headache (at least with the infrastructure from
the earlier patches in place). It's not impossible that "surgery"
could help, in theory (could be a tumor, better to catch these things
early!), but that fact alone can hardly justify such a drastic
measure. That doesn't mean that brain surgery isn't ever appropriate,
of course. It should be delayed until it starts to become obvious that
it's really necessary (but before it really is too late).

> If you
> admit that they exist and think they matter but believe that there's
> some reason why increasing FreezeLimit can't cause any damage, then
> either (a) you have a good reason for that belief which I have thus
> far been unable to understand or (b) you're more optimistic about the
> proposed change than can be entirely justified.

I don't deny that it's just about possible that the changes that I'm
thinking of could make the situation worse in some cases, but I think
that the overwhelming likelihood is that things will be improved
across the board.

Consider the age of the tables from BenchmarkSQL, with the patch series:

relname │ age │ mxid_age
──────────────────┼─────────────┼──────────
bmsql_district │ 657 │ 0
bmsql_warehouse │ 696 │ 0
bmsql_item │ 1,371,978 │ 0
bmsql_config │ 1,372,061 │ 0
bmsql_new_order │ 3,754,163 │ 0
bmsql_history │ 11,545,940 │ 0
bmsql_order_line │ 23,095,678 │ 0
bmsql_oorder │ 40,653,743 │ 0
bmsql_customer │ 51,371,610 │ 0
bmsql_stock │ 51,371,610 │ 0
(10 rows)

We see significant "natural variation" here, unlike HEAD, where the
age of all tables is exactly the same at all times, or close to it
(incidentally, this leads to the largest tables all being
anti-wraparound VACUUMed at the same time). There is a kind of natural
ebb and flow for each table over time, as relfrozenxid is advanced,
due in part to workload characteristics. Less than half of all XIDs
will ever modify the two largest tables, for example, and so
autovacuum should probably never be launched because of the age of
either table (barring some change in workload conditions, perhaps). As
I've said a few times now, XIDs are generally "the wrong unit", except
when needed as a backstop against wraparound failure.

The natural variation that I see contributes to my optimism. A
situation where we cannot get a cleanup lock may well resolve itself,
for many reasons, that are hard to precisely nail down but are
nevertheless very real.

The vacuum_freeze_min_age design (particularly within an aggressive
VACUUM) is needlessly rigid, probably just because the assumption
before now has always been that we can only advance relfrozenxid in an
aggressive VACUUM (it might happen in a non-aggressive VACUUM if we
get very lucky, which cannot be accounted for). Because it is rigid,
it is brittle. Because it is brittle, it will (on a long enough
timeline, for a susceptible workload) actually break.

> On the other hand if that user is going to close that
> cursor after 10 minutes and open a new one in the same place 10
> seconds later, the best thing to do is to keep FreezeLimit as low as
> possible, because the first time we wait for the pin to be released
> we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
> we don't do that we may keep missing the brief windows in which no
> cursor is held for a very long time. But we have absolutely no way of
> knowing which of those things is going to happen on any particular
> system, or of estimating which one is more common in general.

I agree with all that, and I think that this particular scenario is
the crux of the issue.

The first time this happens (and we don't get a cleanup lock), then we
will at least be able to set relfrozenxid to the exact oldest unfrozen
XID. So that'll already have bought us some wallclock time -- often a
great deal (why should the oldest XID on such a page be particularly
old?). Furthermore, there will often be many more VACUUMs before we
need to do an aggressive VACUUM -- each of these VACUUM operations is
an opportunity to freeze the oldest tuple that holds up cleanup. Or
maybe this XID is in a dead tuple, and so somebody's opportunistic
pruning operation does the right thing for us. Never underestimate the
power of dumb luck, especially in a situation where there are many
individual "trials", and we only have to get lucky once.

If and when that doesn't work out, and we actually have to do an
anti-wraparound VACUUM, then something will have to give. Since
anti-wraparound VACUUMs are naturally confined to certain kinds of
tables/workloads with the patch series, we can now be pretty confident
that the problem really is with this one problematic heap page, with
the idle cursor. We could even verify this directly if we wanted to,
by noticing that the preexisting relfrozenxid is an exact match for
one XID on some can't-cleanup-lock page -- we could emit a WARNING
about the page/tuple if we wanted to. To return to my colorful analogy
from earlier, we now know that the patient almost certainly has a
brain tumor.

What new risk is implied by delaying the wait like this? Very little,
I believe. Lets say we derive FreezeLimit from
autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We
still ought to have the opportunity to wait for the cleanup lock for
rather a long time -- if the XID consumption rate is so high that that
isn't true, then we're doomed anyway. All told, there seems to be a
huge net reduction in risk with this design.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bossart, Nathan 2022-01-19 19:57:23 Re: do only critical work during single-user vacuum?
Previous Message John Naylor 2022-01-19 19:31:51 Re: Time to increase hash_mem_multiplier default?