Re: Eager page freeze criteria clarification

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: Eager page freeze criteria clarification
Date: 2023-09-29 18:27:33
Message-ID: CA+Tgmoa=0XhJ=Eo61m8vor4wUki2hhJoCT-syukidHEvsa+DmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 29, 2023 at 11:57 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Assuming your concern is more or less limited to those cases where the
> same page could be frozen an unbounded number of times (once or almost
> once per VACUUM), then I think we fully agree. We ought to converge on
> the right behavior over time, but it's particularly important that we
> never converge on the wrong behavior instead.

I think that more or less matches my current thinking on the subject.
A caveat might be: If it were once per two vacuums rather than once
per vacuum, that might still be an issue. But I agree with the idea
that the case that matters is *repeated* wasteful freezing. I don't
think freezing is expensive enough that individual instances of
mistaken freezing are worth getting too stressed about, but as you
say, the overall pattern does matter.

> The TPC-C scenario is partly interesting because it isn't actually
> obvious what the most desirable behavior is, even assuming that you
> had perfect information, and were not subject to practical
> considerations about the complexity of your algorithm. There doesn't
> seem to be perfect clarity on what the goal should actually be in such
> scenarios -- it's not like the problem is just that we can't agree on
> the best way to accomplish those goals with this specific workload.
>
> If performance/efficiency and performance stability are directly in
> tension (as they sometimes are), how much do you want to prioritize
> one or the other? It's not an easy question to answer. It's a value
> judgement as much as anything else.

I think that's true. For me, the issue is what a user is practically
likely to notice and care about. I submit that on a
not-particularly-busy system, it would probably be fine to freeze
aggressively in almost every situation, because you're only incurring
costs you can afford to pay. On a busy system, it's more important to
be right, or at least not too badly wrong. But even on a busy system,
I think that when the time between data being written and being frozen
is more than a few tens of minutes, it's very doubtful that anyone is
going to notice the contribution that freezing makes to the overall
workload. They're much more likely to notice an annoying autovacuum
than they are to notIce a bit of excess freezing that ends up getting
reversed. But when you start cranking the time between writing data
and freezing it down into the single-digit numbers of minutes, and
even more if you push down to tens of seconds or less, now I think
people are going to care more about useless freezing work than about
long-term autovacuum risks. Because now their database is really busy
so they care a lot about performance, and seemingly most of the data
involved is ephemeral anyway.

> Even if you're willing to assume that vacuum_freeze_min_age isn't just
> an arbitrary threshold, this still seems wrong. vacuum_freeze_min_age
> is applied by VACUUM, at the point that it scans pages. If VACUUM were
> infinitely fast, and new VACUUMs were launched constantly, then
> vacuum_freeze_min_age (and this bucketing scheme) might make more
> sense. But, you know, they're not. So whether or not VACUUM (with
> Andres' algorithm) deems a page that it has frozen to have been
> opportunistically frozen or not is greatly influenced by factors that
> couldn't possibly be relevant.

I'm not totally sure that I'm understanding what you're concerned
about here, but I *think* that the issue you're worried about here is:
if we have various rules that can cause freezing, let's say X Y and Z,
and we adjust the aggressiveness of rule X based on the performance of
rule Y, that would be stupid and might suck.

Assuming that the previous sentence is a correct framing, let's take X
to be "freezing based on the page LSN age" and Y to be "freezing based
on vacuum_freeze_min_age". I think the problem scenario here would be
if it turns out that, under some set of circumstances, Y freezes more
aggressively than X. For example, suppose the user runs VACUUM FREEZE,
effectively setting vacuum_freeze_min_age=0 for that operation. If the
table is being modified at all, it's likely to suffer a bunch of
unfreezing right afterward, which could cause us to decide to make
future vacuums freeze less aggressively. That's not necessarily what
we want, because evidently the user, at least at that moment in time,
thought that previous freezing hadn't been aggressive enough. They
might be surprised to find that flash-freezing the table inhibited
future automatic freezing.

Or suppose that they just have a very high XID consumption rate
compared to the rate of modifications to this particular table, such
that criteria related to vacuum_freeze_min_age tend to be satisfied a
lot, and thus vacuums tend to freeze a lot no matter what the page LSN
age is. This scenario actually doesn't seem like a problem, though. In
this case the freezing criterion based on page LSN age is already not
getting used, so it doesn't really matter whether we tune it up or
down or whatever.

The earlier scenario, where the user ran VACUUM FREEZE, is weirder,
but it doesn't sound that horrible, either. I did stop to wonder if we
should just remove vacuum_freeze_min_age entirely, but I don't really
see how to make that work. If we just always froze everything, then I
guess we wouldn't need that value, because we would have effectively
hard-coded it to zero. But if not, we need some kind of backstop to
make sure that XID age eventually triggers freezing even if nothing
else does, and vacuum_freeze_min_age is that thing.

So I agree there could maybe be some kind of problem in this area, but
I'm not quite seeing it.

> Okay then. I guess it's more accurate to say that we'll have a strong
> bias in the direction of freezing when an FPI won't result, though not
> an infinitely strong bias. We'll at least have something that can be
> thought of as an improved version of the FPI thing for 17, I think --
> which is definitely significant progress.

I do kind of wonder whether we're going to care about the FPI thing in
the end. I don't mind if we do. But I wonder if it will prove
necessary, or even desirable. Andres's algorithm requires a greater
LSN age to trigger freezing when an FPI is required than when one
isn't. But Melanie's test results seem to me to show that using a
small LSN distance freezes too much on pgbench_accounts-type workloads
and using a large one freezes too little on insert-only workloads. So
I'm currently feeling a lot of skepticism about how useful it is to
vary the LSN-distance threshold as a way of controlling the behavior.
Maybe that intuition is misplaced, or maybe it will turn out that we
can use the FPI criterion in some more satisfying way than using it to
frob the LSN distance. But if the algorithm does an overall good job
guessing whether pages are likely to be modified again soon, then why
care about whether an FPI is required? And if it doesn't, is caring
about FPIs good enough to save us?

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2023-09-29 18:38:39 Re: Fix incorrect comment reference
Previous Message Bruce Momjian 2023-09-29 18:26:19 Re: Fix incorrect comment reference