Re: Flush pgstats file during checkpoints

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Flush pgstats file during checkpoints
Date: 2024-07-05 04:52:31
Message-ID: Zod8D_w-ia9PSrCX@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
> I think those are two independent issues - knowing that the snapshot is
> from the last checkpoint, and knowing that it's correct (not corrupted).
> And yeah, we should be careful about fsync/durable_rename.

Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.

> Yeah, I was wondering about the same thing - can't this mean we fail to
> start autovacuum? Let's say we delete a significant fraction of a huge
> table, but then we crash before the next checkpoint. With this patch we
> restore the last stats snapshot, which can easily have
>
> n_dead_tup | 0
> n_mod_since_analyze | 0
>
> for the table. And thus we fail to notice the table needs autovacuum.
> AFAIK we run autovacuum on all tables with missing stats (or am I
> wrong?). That's what's happening on replicas after switchover/failover
> too, right?

That's the opposite, please look at relation_needs_vacanalyze(). If a
table does not have any stats found in pgstats, like on HEAD after a
crash, then autoanalyze is skipped and autovacuum happens only for the
anti-wrap case.

For the database case, rebuild_database_list() uses
pgstat_fetch_stat_dbentry() three times, discards entirely databases
that have no stats. Again, there should be a stats entry post a
crash upon a reconnection.

So there's an argument in saying that the situation does not get worse
here and that we actually may improve odds by keeping a trace of the
previous stats, no? In most cases, there would be a stats entry
post-crash as an INSERT or a scan would have re-created it, but the
stats would reflect the state registered since the last crash recovery
(even on HEAD, a bulk delete bumping the dead tuple counter would not
be detected post-crash). The case of spiky workloads may impact the
decision-making, of course, but at least we'd allow autovacuum to take
some decision over giving up entirely based on some previous state of
the stats. That sounds like a win for me with steady workloads, less
for bulby workloads..

> It'd not be such an issue if we updated stats during recovery, but I
> think think we're doing that. Perhaps we should, which might also help
> on replicas - no idea if it's feasible, though.

Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries). If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery. Not sure if that's worth the complication. First,
the stats exist at node-level.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-07-05 04:57:26 Re: Use generation memory context for tuplestore.c
Previous Message Peter Smith 2024-07-05 04:15:36 Re: Logical Replication of sequences