Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?
Date: 2021-12-07 15:58:02
Message-ID: CA+TgmoYemQxRN7EC+2UD6Lp0g9qu02g9SFZ_wECsBaB1NibFuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 6, 2021 at 9:42 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Mon, Dec 6, 2021 at 6:11 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > This doesn't seem convincing. Launching autovacuum too soon surely has
> > costs that someone might not want to pay. Clearly in the degenerate
> > case where we always autovacuum every table every time an autovacuum
> > worker is launched, we have gone insane.
>
> Unfortunately we already sometimes behave insanely in exactly the way
> that you describe:
>
> https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=ww@mail.gmail.com

In the same ballpark is http://rhaas.blogspot.com/2020/02/useless-vacuuming.html

> It's hard to define a break-even point for launching an autovacuum
> worker. I think it would be more productive to come up with a design
> that at least doesn't go completely off the rails in various specific
> ways.

I think that's a good observation. I think the current autovacuum
algorithm works pretty well when things are normal. But in extreme
scenarios it does not degrade gracefully. The
vacuuming-over-and-over-without-doing-anything phenomenon is an
example of that. Another example is the user who creates 10,000
databases and we're happy to divide the 60s-autovacuum_naptime by
10,000 and try to launch a worker every 0.6 ms. A third example is
vacuuming the tables from pg_class in physical order on disk, so that
a table that is 1 XID past the wraparound limit can result in a long
delay vacuuming a table that is bloating quickly, or conversely a
table that is bloating very slowly but has just crawled over the
threshold for a regular vacuum gets processed before one that is
threatening an imminent wraparound shutdown. I think these are all
pathological cases that a well-informed human can easily recognize and
handle in an intelligent manner, and it doesn't seem crazy to program
those responses into the computer in some way.

> I also think that our problem is not so much that we don't have
> accurate statistics about dead tuples (though we surely don't have
> much accuracy). The main problem seems to be that there are various
> specific, important ways in which the distribution of dead tuples may
> matter (leading to various harmful biases). And so it seems reasonable
> to fudge how we interpret dead tuples with the intention of capturing
> some of that, as a medium term solution. Something that considers the
> concentration of dead tuples in heap pages seems promising.

I am less convinced about this part. It sort of sounds like you're
saying - it doesn't really matter whether the numbers we gather are
accurate, just that they produce the correct results. If the
information we currently gather wouldn't produce the right results
even if it were fully accurate, that to me suggests that we're
gathering the wrong information, and we should gather something else.
For example, we could attack the useless-vacuuming problem by having
each vacuum figure out - and store - the oldest XID that could
potentially be worth using as a cutoff for vacuuming that table, and
refusing to autovacuum that table again until we'd be able to use a
cutoff >= that value. I suppose this would be the oldest of (a) any
XID that exists in the table on a tuple that we judged recently dead,
(b) any XID that is currently-running, and (c) the next XID.

I also accept that knowing the concentration of dead tuples on pages
that contain at least 1 dead tuple could be interesting. I've felt for
a while that it's a mistake to know how many dead tuples there are but
not how many pages contain them, because that affects both the amount
of I/O required to vacuum and also how much need we have to set VM
bits. I'm not sure I would have approached gathering that information
in the way that you're proposing here, but I'm not deeply against it,
either. I do think that we should try to keep it as a principle that
whatever we do gather, we should try our best to make accurate. If
that doesn't work well, then gather different stuff instead.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-12-07 15:59:45 Re: Dubious usage of TYPCATEGORY_STRING
Previous Message Anton Voloshin 2021-12-07 15:54:57 Re: Triage for unimplemented geometric operators