Re: Disabling vacuum truncate for autovacuum

From: Gurjeet Singh <gurjeet(at)singh(dot)im>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Will Storey <will(at)summercat(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Disabling vacuum truncate for autovacuum
Date: 2025-03-16 00:14:14
Message-ID: CABwTF4WFp-oNSm6utik6g-1HZBxijTJBcuX7BKD8MYqza+KMnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

+Andres Freund, since an old email of his is quoted here.

On Fri, Mar 14, 2025 at 8:45 AM Nathan Bossart <nathandbossart(at)gmail(dot)com> wrote:
>
> On Thu, Mar 06, 2025 at 08:54:59AM +0900, Fujii Masao wrote:
> > +1 to having the reloption (if specified) override the GUC setting.
> > That is, I think that autovacuum_vacuum_truncate as defining
> > the default behavior for VACUUM truncation, and that the GUC should
> > only apply when neither the TRUNCATE option in VACUUM nor
> > the reloption is set.
>
> One other difference in my version of the patch [0] is to call this GUC
> vacuum_truncate and have it apply to both autovacuum and VACUUM. I did
> this for the following reasons:

+1 for the GUC name for the reasons you identified. But -1 for the behaviour
where the reloption and vacuum command's options overrides GUC.

I did not consider the VACUUM (TRUNCATE) command, and was focused on autovacuum,
since that's what caused the replication stall and the consequent application
outage in the anecdote I was relaying.

I'd like to bring our attention back to how this thread started. Will started
the discussion by asking for a way to disable autovacuum's truncate behaviour.
Even though the production outage he faced was due to manual vacuum, he was
worried about the same behaviour that autovacuum may cause, especially since the
parameter old_snapshot_threshold is no longer available; old_snapshot_threshold
allowed sysadmins like Will to disable the truncation behaviour globally. I
provided an anecdote where autovacuum's truncation behaviour had in fact caused
a replication outage as well as the consequent application outage.

The behaviour that is being proposed here does not prevent that situation from
arising again. A sysadmin who's trying to prevent replication outage and
a consequent application outage won't benefit from tuning vacuum_truncate GUC,
because a reloption or VACUUM (TRUNCATE) command will override its behaviour,
and lead to an outage.

We want this new GUC to give the sysadmin the power to override the per-relation
and per-command settings.

> IIUC reloptions with corresponding GUCs typically override the GUC setting,
> although autovacuum_enabled is arguably an exception. If setting the GUC
> to false overrides the relation-specific settings, it also becomes more
> difficult to enable truncation for just a couple of tables, although that
> might not be a popular use-case. Furthermore, even if we do want the GUC
> to override the reloption, it won't override VACUUM (TRUNCATE).

I guess what I'm looking for is a global switch that guarantees no relation
truncation will take place on the instance, so that the relevant replication
record is never emitted, and hence will never lead to a blocked replication on
the replica and never cause a consequent outage of applications connected to the
replica(s). That is, as a sysadmin, I need a global variable that overrides and
disables any relation-level and command-level truncation operations. I guess
that's why naming the GUC *disable*_vacuum_truncate made sense to me when I
initially proposed the autovacuum patch, since it was preventing autovacuum's
default truncation behaviour.

The downside of disabling _all_ truncation operations is that database size can
only grow and never shrink, and it may be frustrating for a regular user who's
trying to shrink their table sizes. Even in those situations, a sysadmin may
prefer that none of the tables ever be shrunk in normal operation, rather than
running the risk of causing a replication outage and a consequent application
outage.

In Will's words:
> I expect I would need to monitor
> tables for problematic growth, but that might be better than a surprise
> outage. I suppose the growth could cause an outage too, but I'm thinking it
> would be more controllable.

The sysadmin can schedule a periodic maintenance window where the GUC is changed
for the duration of the maintenance, allowing truncation operations globally,
and then running VACUUM (TRUNCATE), or relying on autovacuum to truncate
relations within that maintenance period.

With the proposed v2 patch behaviour, a regular user is still in control of
truncation behaviour, and hence can cause a replication outage and application
outage on the replicas. Essentially, this patch fails to help the sysadmin, and
leaves them at the mercy of table owners and database owners.

Perhaps the compromise is that that the sysadmin will run a regular script to
check that none of the relations have the reloption set to truncate the
relation, and also communicate to the application developers that they shouldn't
run the VACUUM (TRUNCATE) command. But anyone who has run a moderately large
fleet of Postgres instances would agree that both those things are not
foolproof, and will tend to be ignored or forgotten in the long run. Especially
the VACUUM (TRUNCATE) option is so enticing for the average user that it's near
impossible to prevent them from using it.

In the message linked by Will upthread, Andres says:
> The production issue is that
> autovacuums constantly cancel queries on the standbys despite
> hot_standby_feedback if you have a workload that does frequent
> truncations. If there's no way to configure it in a way that autovacuum
> takes it into account, people will just disable autovacuum and rely
> entirely on manual scripts. That's what already happens - leading to a
> much bigger foot-gun than disabling truncation. FWIW, people already in
> production use the workaround to configuring snapshot_too_old as that,
> for undocumented reasons, disables trunctations.

The snapshot_too_old error/feature is not available anymore, and add to that the
fact that manual scripts are not effective for transient tables; tables with
short-enough lifetime that they may be truncated before the next run of the
scripts.

I think this exercise is moot if we're not solving the problem of letting
sysadmin disable truncation altogether, and eliminating the possibility of an
outage. By moving forward with v2, we're not addressing the problem that Will
started this conversation with.

It's entirely possible that I might be misunderstanding your proposal, though,
in which case please feel free to illuminate how the proposed v2 patch will help
a sysadmin counter vacuum's and autovacuum's truncation behaviour, despite the
average user unknowingly monkeying with the database availability.

If the GUC is named vacuum_truncate, then as Fujii Masao reasoned, it can be
taken to mean the default value of per-relation and VACUUM options.

So I propose that GUC be named disable_vacuum_truncate (antonym of the name you
proposed), since it _prevents_ the default truncation behaviour of vacuum and
autovacuum. When false, vacuum and autovacuum perform truncation as they do
normally. When true, no truncations are performed, neither by vacuum nor by
autovacuum; they instead emit a NOTICE, notifying that relation truncation was
prevented by the GUC disable_vacuum_truncate. Default value should be false, to
match the current behaviour. The GUC context should be PGC_SIGHUP so that it can
be changed only by the sysadmin, and can be changed without requiring a restart.

Optional: disable_vacuum_truncate=true has no effect if wal_level==minimal,
since this whole debate and feature is to avoid replication outage, and any
downstream effects on replicas, hence irrelevant to systems where replication is
not in use.

> I should also mention that we just have a few weeks left in the v18
> development cycle. The code itself seems pretty straightforward, so if we
> can agree on behavior and nomenclature, I'll do my darndest to get this
> responsibly committed in time.

Thank you for all your help and work on this patch! Unfortunately I haven't been
able to work on it as much as I would've liked to.

Best regards,
Gurjeet
http://Gurje.et

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2025-03-16 05:29:17 Re: Disabling vacuum truncate for autovacuum
Previous Message Tom Lane 2025-03-15 19:02:08 Re: After upgrading libpq, the same function(PQftype) call returns a different OID

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Schneider 2025-03-16 01:23:37 Re: Update Unicode data to Unicode 16.0.0
Previous Message Alexander Borisov 2025-03-15 22:28:00 Re: Optimization for lower(), upper(), casefold() functions.