Re: decoupling table and index vacuum

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: decoupling table and index vacuum
Date: 2021-04-22 18:37:46
Message-ID: CA+TgmoZU6xAZ8hypQk9YJZt4WQFuouyBfSOuL+J7R4WkWGsQog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 22, 2021 at 7:51 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> How do we decide this target, I mean at a given point how do we decide
> that what is the limit of dead TID's at which we want to trigger the
> index vacuuming?

It's tricky. Essentially, it's a cost-benefit analysis. On the "cost"
side, the expense associated with an index vacuum is basically the
number of pages that we're going to visit, and the number of those
that we're going to dirty. We can know the former with certainty but
can only estimate the latter. On the "benefit" side, setting dead TIDs
unused helps us in two ways. First, it lets us mark heap pages
all-visible, which makes index-only scans work better and reduces the
cost of future vacuuming. These benefits are mitigated by future DML
unsetting those bits again; there's no point in marking a page
all-visible if it's about to be modified again. Second, it avoids line
pointer bloat. Dead line pointers still take up space on the page, and
potentially force the line pointer array to be extended, which can
eventually cause tuples that would have fit into the page to spill
into a different page, possibly a newly-allocated one that forces a
table extension.

It's hard to compare the costs to the benefits because we don't know
the frequency of access. Suppose it costs $1000 to vacuum relevant
indexes and set dead line pointers unused. And suppose that if you do
it, you thereafter will save $1 every time someone does an index-only
scan. If there will be >1000 index-only scans in a meaningful time
frame, it's a good trade-off, but if not, it's a bad one, but it's
difficult to predict the future, and we have limited information even
about the past.

My intuition is that two things that we want to consider are the total
number of dead line pointers in the heap, and the number of pages
across which they are spread. It is also my intuition that the latter
is the more important number, possibly to the extent that we could
ignore the former number completely. But exactly what the thresholds
should be is very unclear to me.

> Is it a good idea to always perform an I/O after collecting the dead
> TID's or there should be an option where the user can configure so
> that it aggressively vacuum all the indexes and this I/O overhead can
> be avoided completely.

It's my view that there should definitely be such an option.

As I also mentioned on another thread recently, suppose we pick words
for each phase of vacuum. For the sake of argument, suppose we refer
to the first heap phase as PRUNE, the index phase as SANITIZE, and the
second heap phase as RECYCLE. Then you can imagine syntax like this:

VACUUM (PRUNE) my_table;
VACUUM (SANITIZE) my_table; -- all indexes
VACUUM my_index; -- must be sanitize only
VACUUM (PRUNE, SANITIZE, RECYCLE) my_table; -- do everything

Now in the last case is clearly possible for the system to do
everything in memory since all phases are being performed, but
depending on what we decide, maybe it will choose to use the dead-TID
fork in some cases for some reason or other. If so, we might have
explicit syntax to override that behavior, e.g.

VACUUM (PRUNE, SANITIZE, RECYCLE, TID_STORE 0) my_table;

which might be able to be abbreviated, depending on how we set the
defaults, to just:

VACUUM (TID_STORE 0) my_table;

This is all just hypothetical syntax and probably needs a good deal of
polish and bike-shedding. But it would be really nice to standardize
on some set of terms like prune/sanitize/recycle or whatever we pick,
because then we could document what they mean, use them in autovacuum
log messages, use them internally for function names, use them for
VACUUM option names when we get to that point, etc. and the whole
thing would be a good deal more comprehensible than at present.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-04-22 18:44:00 Re: decoupling table and index vacuum
Previous Message Tom Lane 2021-04-22 18:37:26 Re: posgres 12 bug (partitioned table)