From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations |
Date: | 2022-03-24 19:28:08 |
Message-ID: | CAH2-WzmL98Ayv+5MsUF1xy_+brhconV43QgDi=PTBO5GWoioEg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 24, 2022 at 10:21 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> You're probably not going to love hearing this, but I think you're
> still explaining things here in ways that are too baroque and hard to
> follow. I do think it's probably better.
There are a lot of dimensions to this work. It's hard to know which to
emphasize here.
> But, for example, in the
> commit message for 0001, I think you could change the subject line to
> "Allow non-aggressive vacuums to advance relfrozenxid" and it would be
> clearer.
But non-aggressive VACUUMs have always been able to do that.
How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"
> And then I think you could eliminate about half of the first
> paragraph, starting with "There is no fixed relationship", and all of
> the third paragraph (which starts with "Later work..."), and I think
> removing all that material would make it strictly more clear than it
> is currently. I don't think it's the place of a commit message to
> speculate too much on future directions or to wax eloquent on
> theoretical points. If that belongs anywhere, it's in a mailing list
> discussion.
Okay, I'll do that.
> It seems to me that 0002 mixes code movement with functional changes.
Believe it or not, I avoided functional changes in 0002 -- at least in
one important sense. That's why you had difficulty spotting any. This
must sound peculiar, since the commit message very clearly says that
the commit avoids a problem seen only in the non-aggressive case. It's
really quite subtle.
You wrote this comment and code block (which I propose to remove in
0002), so clearly you already understand the race condition that I'm
concerned with here:
- if (skipping_blocks && blkno < rel_pages - 1)
- {
- /*
- * Tricky, tricky. If this is in aggressive vacuum, the page
- * must have been all-frozen at the time we checked whether it
- * was skippable, but it might not be any more. We must be
- * careful to count it as a skipped all-frozen page in that
- * case, or else we'll think we can't update relfrozenxid and
- * relminmxid. If it's not an aggressive vacuum, we don't
- * know whether it was initially all-frozen, so we have to
- * recheck.
- */
- if (vacrel->aggressive ||
- VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
- vacrel->frozenskipped_pages++;
- continue;
- }
What you're saying here boils down to this: it doesn't matter what the
visibility map would say right this microsecond (in the aggressive
case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
said that this page was all-frozen *in the recent past*. That's good
enough; we will never fail to scan a page that might have an XID <
OldestXmin (ditto for XMIDs) this way, which is all that really
matters.
This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?
In 0002 we now track if each range that we actually chose to skip had
any all-visible (not all-frozen) pages -- if that happens then
relfrozenxid advancement becomes unsafe. The existing code uses
"vacrel->aggressive" as a proxy for the same condition -- the existing
code reasons based on what the visibility map must have said about the
page in the recent past. Which makes sense, but only works in the
aggressive case. The approach taken in 0002 also makes the code
simpler, which is what enabled putting the VM skipping code into its
own helper function, but that was just a bonus.
And so you could almost say that there is now behavioral change at
all. We're skipping pages in the same way, based on the same
information (from the visibility map) as before. We're just being a
bit more careful than before about how that information is tracked, to
avoid this race. A race that we always avoided in the aggressive case
is now consistently avoided.
> I'm completely on board with moving the code that decides how much to
> skip into a function. That seems like a great idea, and probably
> overdue. But it is not easy for me to see what has changed
> functionally between the old and new code organization, and I bet it
> would be possible to split this into two patches, one of which creates
> a function, and the other of which fixes the problem, and I think that
> would be a useful service to future readers of the code.
It seems kinda tricky to split up 0002 like that. It's possible, but
I'm not sure if it's possible to split it in a way that highlights the
issue that I just described. Because we already avoided the race in
the aggressive case.
> I also think that the commit message for 0002 is probably longer and
> more complex than is really helpful, and that the subject line is too
> vague, but since I don't yet understand exactly what's happening here,
> I cannot comment on how I think it should be revised at this point,
> except to say that the second paragraph of that commit message looks
> like the most useful part.
I'll work on that.
> I would also like to mention a few things that I do like about 0002.
> One is that it seems to collapse two different pieces of logic for
> page skipping into one. That seems good. As mentioned, it's especially
> good because that logic is abstracted into a function. Also, it looks
> like it is making a pretty localized change to one (1) aspect of what
> VACUUM does -- and I definitely prefer patches that change only one
> thing at a time.
Totally embracing the idea that we don't necessarily need very recent
information from the visibility map (it just has to be after
OldestXmin was established) has a lot of advantages, architecturally.
It could in principle be hours out of date in the longest VACUUM
operations -- that should be fine. This is exactly the same principle
that makes it okay to stick with our original rel_pages, even when the
table has grown during a VACUUM operation (I documented this in commit
73f6ec3d3c recently).
We could build on the approach taken by 0002 to create a totally
comprehensive picture of the ranges we're skipping up-front, before we
actually scan any pages, even with very large tables. We could in
principle cache a very large number of skippable ranges up-front,
without ever going back to the visibility map again later (unless we
need to set a bit). It really doesn't matter if somebody else unsets a
page's VM bit concurrently, at all.
I see a lot of advantage to knowing our final scanned_pages almost
immediately. Things like prefetching, capping the size of the
dead_items array more intelligently (use final scanned_pages instead
of rel_pages in dead_items_max_items()), improvements to progress
reporting...not to mention more intelligent choices about whether we
should try to advance relfrozenxid a bit earlier during non-aggressive
VACUUMs.
> Hope that's helpful.
Very helpful -- thanks!
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2022-03-24 19:32:54 | pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint. |
Previous Message | Andrew Dunstan | 2022-03-24 19:06:21 | Re: Granting SET and ALTER SYSTE privileges for GUCs |