Re: Eagerly scan all-visible pages to amortize aggressive vacuum

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Robert Treat <rob(at)xzilla(dot)net>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: Eagerly scan all-visible pages to amortize aggressive vacuum
Date: 2025-01-24 20:02:21
Message-ID: CAAKRu_a5iX=Y7Ysy08fsHhntsJPo6cGHfZ4_o-qXoGeoq1zK0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the review! Attached v10 addresses the review feedback
except that about the GUC/table option meaning and format.

On Thu, Jan 23, 2025 at 2:22 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jan 23, 2025 at 1:31 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Wed, Jan 22, 2025 at 5:48 PM Melanie Plageman
> > <melanieplageman(at)gmail(dot)com> wrote:
> > > Thanks! Attached v9 incorporates all your suggested changes.
> >
> > I'm not exactly sure what to do about it, but I feel like the
> > documentation of vacuum_eager_scan_max_fails is going to be
> > incomprehensible to someone who doesn't already have a deep knowledge
> > of how this patch works.
>
> There's a lot of magic numbers here: 4096 blocks per region, 128 fails
> per region, some other number of successes per region. I don't know
> why this particular one is configurable and none of the others are.
> I'm not saying it's the wrong decision, but it's not clear to me what
> the reasoning is.

The reasoning for making the delete cap configurable and not the
success cap is that it made more sense to me to configure your failure
tolerance and not your success tolerance. If you make the success cap
configurable, you can end up scanning and failing to freeze just as
many pages when it is 0 as when it is 100.

This thought exercise made me realize something is wrong with my
current patch, though. If you set the failure tolerance
(vacuum_eager_scan_max_fails) to 0 right now, it disables eager
scanning altogether. That might be unexpected. You would probably
expect setting that to 0 to still allow eager scanning if it is only
succeeding. That made me think that perhaps I should have a -1 value
that disables eager scanning altogether and 0 just sets the failure
tolerance to 0.

I don't think the region size should be configurable because it is
unclear how you might tune that based on your workload to get
different results. We knew we wanted some kind of retry interval, so
we picked one that seemed reasonable.

> Also, the units of this parameter are pretty terrible. Like, it could
> be measured in some unit that involves bytes or kilobytes or
> megabytes, or, maybe better, it could be measured as a percentage.
> Instead of either of those things, it's an integer that only makes
> sense in reference to another (hard-coded, magical) integer. If you
> made this a percentage, then you wouldn't necessarily even have to
> tell people that the magic internal number is 4096. Or at least that
> knowledge would be a lot less critical.

I think you're right. I would go with a percentage. I don't see many
other GUCs that are percents. What would you call it? Perhaps
vacuum_eager_scan_fail_threshold? The % of the blocks in the table
vacuum may scan and fail to freeze.

There is an issue I see with making it a percentage. The current
default vacuum_eager_scan_max_fails is 128 out of 4096. That means you
are allowed to scan about 3% of the blocks in the table even if you
fail to freeze every one. I don't think there are very many reasonable
values above 256, personally. So, it might be weird to add a
percentage GUC value with only very low acceptable values. Part of
this is that we are not making the success cap configurable, so that
means that you might have lots of extra I/O if you are both failing
and succeeding. Someone configuring this GUC might think this is
controlling the amount of extra I/O they are incurring.

Note attached v10 does not change the GUC. I thought I would wait for
additional discussion.

> + <command>VACUUM</command> typically scans pages that have been
> + modified since the last vacuum. While some all-visible but not
> + all-frozen pages are eagerly scanned to try and freeze them, the
> + <structfield>relfrozenxid</structfield> can only be advanced when
> + every page of the table that might contain unfrozen XIDs is scanned.
>
> The phrase "While some all-visible but not all-frozen pages are
> eagerly scanned to try and freeze them" seems odd here, because what
> follows is true either way. This is like saying "When it's Tuesday, I
> get hungry if I don't eat any food."

The while was meant to contrast the sentence before it about typically
only scanning pages that have been modified since the previous vacuum.
But, I see how that didn't work out for me. How about this:

"""
<command>VACUUM</command> typically scans pages that have been
modified since the last vacuum. However, some all-visible but not
all-frozen pages are eagerly scanned to try and freeze them. Note that
the <structfield>relfrozenxid</structfield> can only be advanced when
every page of the table that might contain unfrozen XIDs is scanned.
"""

I know the last sentence seems a bit unrelated to the sentence before
it. But I am trying to contrast the three situations. Most of the
pages vacuum scanned are ones that have been modified since the
previous vacuum. Some of them might be unmodified and we are scanning
them because we want to freeze them. But, we also need to advance the
relfrozenxid, so occasionally a large number of the pages that we scan
might be unmodified.

> + /*
> + * We only want to enable eager scanning if we are likely to be able to
> + * freeze some of the pages in the relation. We can freeze tuples older
> + * than the visibility horizon calculated at the beginning of vacuum, but
> + * we are only guaranteed to freeze them if at least one tuple on the page
> + * precedes the freeze limit or multixact cutoff (calculated from
> + * vacuum_[multixact_]freeze_min_age). So, if the oldest unfrozen xid
> + * (relfrozenxid/relminmxid) does not precede the freeze cutoff, we aren't
> + * likely to freeze many tuples.
> + */
>
> I think this could be clearer. And I wonder if "We can freeze tuples
> older" is actually a typo for "We can freeze tuples newer", because
> otherwise I can't make sense of the rest of the sentence. Right now it
> seems to boil down to something like: We can freeze tuples older...but
> we are only guaranteed to freeze them if they are older.

I was trying to contrast opportunistic freezing with "guaranteed
freezing". I was trying to say that we can freeze tuples older than
FreezeLimit but we are only guaranteed to freeze tuples older than
OldestXmin.

> Maybe what we want to say here is something like: We only want to
> enable eager scanning if we are likely to be able to freeze some of
> the pages in the relation. If FreezeLimit has not advanced past
> relfrozenxid, or if MultiXactCutoff has not advanced passed
> relminmxid, then there's a good chance we won't be able to freeze
> anything more than we already have. Success would not be impossible,
> because there may be pages we didn't try to freeze previously, and
> it's also possible that some XID greater than FreezeLimit has ended,
> allowing for freezing. But as a heuristic, we wait until the
> FreezeLimit advances, so that we won't repeatedly attempt eager
> freezing while the same long-running transaction blocks progress every
> time.

I rewrote the comment using some of your input and trying to clarify
my initial point about opportunistic vs guaranteed freezing.

"""
If FreezeLimit has not advanced past the relfrozenxid or
MultiXactCutoff has not advanced past relminmxid, we are unlikely to
be able to freeze anything new.
Tuples with XIDs older than OldestXmin or MXIDs older than OldestMxact
are technically freezable, but we won't freeze them unless some other
criteria for opportunistic freezing is met.
It is also possible than a transaction newer than the FreezeLimit has
ended, rendering additional tuples freezable.
As a heuristic, however, it makes sense to wait until the FreezeLimit
or MultiXactCutoff has advanced before eager scanning.
"""

> + ereport(INFO,
> + (errmsg("Vacuum successfully froze %u eager scanned blocks of
> \"%s.%s.%s\". Now disabling eager scanning.",
>
> I predict that if Tom sees this, you will get complaints about both
> the wording of the message, which pretty clearly does not conform to
> style guidelines for a primary error message, and also about the use
> of the INFO level. Allow me to suggest DEBUG1 or DEBUG2 and "disabling
> eager scanning after freezing %u eagerly scanned blocks".

I've used your wording. Just for future reference, are the style
guidelines I was violating the capitalization and punctuation? Are
these documented somewhere? Also, what is a primary error message?
INFO level? Or ones that use ereport and are translated? I looked at
other messages and saw that they don't capitalize the first word or
use punctuation, so I assume that those were problems.

> + /*
> + * Aggressive vacuums cannot skip all-visible pages that are not also
> + * all-frozen. Normal vacuums with eager scanning enabled only skip
> + * such pages if they have hit the failure limit for the current eager
> + * scan region.
> + */
> + if (vacrel->aggressive ||
> + vacrel->eager_scan_remaining_fails > 0)
> + {
> + if (!vacrel->aggressive)
> + next_unskippable_eager_scanned = true;
> + break;
> }
>
> I think this would be clearer as two separate if statements. if
> (aggressive) break; if (vacrel->eager_scan_remaining_fails) {
> next_unskippable_eager_scanned = true; break; }

Very true. Fixed.

- Melanie

Attachment Content-Type Size
v10-0001-Eagerly-scan-all-visible-pages-to-amortize-aggre.patch application/octet-stream 40.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-01-24 20:06:17 Re: Reorder shutdown sequence, to flush pgstats later
Previous Message Jim Jones 2025-01-24 19:59:11 Re: XMLDocument (SQL/XML X030)