Re: decoupling table and index vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: decoupling table and index vacuum
Date: 2021-04-21 23:51:35
Message-ID: CAH2-Wzk_h132i7bg5wD9qhiTj9i_q5GZzVBVvyQ2Fu_2=L=uWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> We are used to thinking about table vacuum and index vacuum as parts
> of a single, indivisible operation. You vacuum the table -- among
> other things by performing HOT pruning and remembering dead TIDs --
> and then you vacuum the indexes -- removing the remembered TIDs from
> the index -- and then you vacuum the table some more, setting those
> dead TIDs unused -- and then you're done. And along the way you do
> some other things too like considering truncation that aren't relevant
> to the point I want to make here. Now, the problem with this is that
> every index has its own needs, which are separate from the needs of
> the tables, as I think Peter Geoghegan and Masahiko Sawada were also
> discussing recently.

I'm very happy to see that you've taken an interest in this work! I
believe it's an important area. It's too important to be left to only
two contributors. I welcome your participation as an equal partner in
the broader project to fix problems with VACUUM.

Masahiko and I have had plenty of ideas about where this could go next
-- way too many ideas, in fact. Maybe that kind of partnership sounds
unnecessary or at least seems premature, but it turns out that this
area is extremely broad and far reaching, if you really think it
through -- you end up having to negotiate rather a lot all at once.
Apart from anything else, I simply don't have the authority to commit
a bunch of stuff that implicitly makes Postgres do things a certain
way in a huge number of different subsystems. (Whether or not I'd be
right in each case is beside the point.)

My most ambitious goal is finding a way to remove the need to freeze
or to set hint bits. I think that we can do this by inventing a new
kind of VACUUM just for aborted transactions, which doesn't do index
vacuuming. You'd need something like an ARIES-style dirty page table
to make this cheap -- so it's a little like UNDO, but not very much.
The basic idea is that eagerly cleaning up aborted transactions in an
autovacuum worker allows you to broadly assume that most blocks
contain definitely-committed heap tuples, or else LP_DEAD stubs (which
of course don't contain any XIDs). You'd still have something like
conventional VACUUM, which wouldn't change that much. Freezing is
largely implicit, but maybe you freeze tuples the old way if and only
if a backend dirties a "known-all-commited" block -- that can still be
expensive.

The visibility map still has an all-visible bit, but now it also has
an all-committed bit (or maybe it's a separate data structure). The
combination of all-visible and all-committed to precisely the same as
frozen, so you don't need a separate VM bit for that anymore.

Notice that this design doesn't change much about our basic approach
to transaction management. It just further decouples things.
Conventional VACUUMs are now only about garbage collection, and so can
be further optimized with that goal in mind. It's much easier to do
clever scheduling if VACUUM really only has to do garbage collection.

> Opportunistic index cleanup strategies like
> kill_prior_tuple and bottom-up deletion may work much better for some
> indexes than others, meaning that you could have some indexes that
> badly need to be vacuumed because they are full of garbage, and other
> indexes on the same table where the opportunistic cleanup has worked
> perfectly and there is no need for vacuuming at all.

I know I say this all the time these days, but it seems worth
repeating now: it is a qualitative difference, not a quantitative
difference. Bottom-up index deletion will frequently stop most indexes
on a table from growing by even one single block, while indexes that
cannot use the optimization (indexes that are logically modified by
UPDATE statements) might be hugely bloated. If this is the case during
one VACUUM operation, it's probably going to work like that with all
future VACUUM operations. It's abundantly clear that the current
quantitative approach just cannot be pushed much further.

> But, as things stand today, strategy options to deal with such
> situations are limited. Leaving aside what the code actually does
> right now, let's talk about what options we have in theory with the
> technology as it exists now. They basically all boil down to stopping
> early and then redoing the work later. We must always start with a
> pass over the heap to collect dead TIDs, because otherwise there's
> nothing else to do. Now we can choose to stop, but then the next
> VACUUM will have to collect all those TIDs again. It may get to skip
> more all-visible pages than the current vacuum did, but the pages that
> still have dead TIDs will all have to be visited again. If we don't
> stop, then we can choose to vacuum all of the indexes or just some of
> them, and then afterwards stop. But if we do this, the next VACUUM
> will have to scan all indexes again for the same TIDs. Here, we don't
> even have the visibility map to allow skipping known-clean pages, so
> it's *really* a lot of work we have to redo. Thus what we normally do
> is press on to the third step, where we mark dead line pointers unused
> after scanning every index in its entirety, and now they're gone and
> we don't have to worry about them again. Barring emergency escape
> valves, as things stand today, the frequency of table vacuuming is the
> same as the frequency of index vacuuming, even though the *required*
> frequency of vacuuming is not the same, and also varies from index to
> index.

I'm 100% in agreement about all of this.

> Now, the reason for this is that when we discover dead TIDs, we only
> record them in memory, not on disk. So, as soon as VACUUM ends, we
> lose all knowledge of whether those TIDs were and must rediscover
> them. Suppose we didn't do this, and instead had a "dead TID" fork for
> each table.

I had a similar idea myself recently -- clearly remembering the TIDs
that you haven't vacuumed to save work later on makes a lot of sense.
I didn't get very far with it, even in my own head, but I like the
direction you're taking it. Having it work a little like a queue makes
a lot of sense.

> Suppose further that this worked like a conveyor belt,
> similar to WAL, where every dead TID we store into the fork is
> assigned an identifying 64-bit number that is never reused. Then,
> suppose that for each index, we store the number of the oldest entry
> that might still need to be vacuumed from the index. Every time you
> perform what we now call the first heap pass of a VACUUM, you add the
> new TIDs you find to the dead TID fork.

Maybe we can combine this known-dead-tid structure with the visibility
map. Index-only scans might be able to reason about blocks that are
mostly all-visible, but still have stub LP_DEAD line pointers that
this other structure knows about -- so you can get index-only scans
without requiring a full round of traditional vacuuming. Maybe there
is some opportunity like that, but not sure how to fit it in to
everything else right now.

> Every time you vacuum an
> index, the TIDs that need to be removed are those between the
> oldest-entry pointer for that index and the current end of the TID
> fork. You remove all of those and then advance your oldest-entry
> pointer accordingly. If that's too many TIDs to fit in
> maintenance_work_mem, you can just read as many as will fit and
> advance your oldest-entry pointer less far. Every time you perform
> what we now call the second heap pass of a VACUUM, you find all the
> TIDs that precede every index's oldest-entry pointer and set them
> unused. You then throw away the associated storage at the OS level.
> This requires a scheme where relations can be efficiently truncated
> from the beginning rather than only at the end, which is why I said "a
> conveyor belt" and "similar to WAL". Details deliberately vague since
> I am just brainstorming here.

This amounts to adding yet more decoupling -- which seems great to me.
Anything that gives us the option but not the obligation to perform
work either more lazily or more eagerly (whichever makes sense) seems
highly desirable to me. Especially if we can delay our decision until
the last possible point, when we can have relatively high confidence
that we know what we're doing. And especially if holding it open as an
option is pretty cheap (that's the point of remembering dead TIDs).

> Furthermore, all of these operations can start in any order, and any
> of them can be repeated any number of times during a single run of any
> particular other one, or indeed, without that particular one ever
> being run at all. Both heap phases can also now be done in smaller
> chunks, if desired.

> But is this worthwhile? I think it depends a lot on what you think the
> comparative required frequencies are for the various operations.

There is a risk that you'll never think that any optimization is worth
it because each optimization seems marginal in isolation. Sometimes a
diversity of strategies is the real strategy. Let's say you have a
bunch of options that you can pick and choose from, with fallbacks and
with ways of course correcting even halfway through the VACUUM. It's
possible that that will work wonderfully well for a given complex user
workload, but if you subtract away *any one* of the strategies
suddenly things get much worse in some obvious high-level way. It's
entirely possible for a single table to have different needs in
different parts of the table.

Certainly works that way with indexes -- that much I can say for sure.

> If index A needs to be vacuumed every 40 minutes and index B needs to be
> vacuumed every 55 minutes and the table that owns both of them needs
> to be vacuumed every 70 minutes, I am not sure there is a whole lot
> here. I think you will be pretty much equally well off if you just do
> what we do today every 40 minutes and call it good.

That's probably all true, but I think that an excellent heuristic is
to work hard to avoid really terrible outcomes, rather than trying
hard to get good outcomes. The extremes don't just matter -- they may
even be the only thing that matters.

If index A needs to be vacuumed about as frequently as index B anyway,
then the user happens to naturally be in a position where the current
simplistic scheduling works for them. Which is fine, as far as it
goes, but that's not really where we have problems. If you consider
the "qualitative, not quantitative" perspective, things change. It's
now pretty unlikely that all of the indexes on the same table will
have approximately the same needs -- except when there is very little
to do with indexes anyway, which is pretty much not interesting
anyway. Because workloads generally don't logically modify all indexes
columns within each UPDATE. They just don't tend to look like that in
practice.

> Also, you will not
> benefit very much if the driving force is reclaiming dead line
> pointers in the table itself. If that has to happen frequently, then
> the indexes have to be scanned frequently, and this whole thing is a
> lot of work for not much. But, maybe that's not the case. Suppose
> index A needs to be vacuumed every hour to avoid bloat, index B needs
> to be vacuumed every 4 hours to avoid bloat, and the table needs dead
> line pointers reclaimed every 5.5 hours. Well, now you can gain a lot.
> You can vacuum index A frequently while vacuuming index B only as
> often as it needs, and you can reclaim dead line pointers on their own
> schedule based on whatever index vacuuming was already done for bloat
> avoidance. Without this scheme, there's just no way to give everybody
> what they need without some of the participants being "dragged along
> for the ride" and forced into work that they don't actually need done
> simply because "that's how it works."

Right. And, the differences between index A and index B will tend to
be pretty consistent and often much larger than this.

Many indexes would never have to be vacuumed, even with non-HOT
UPDATES due to bottom-up index deletion -- because they literally
won't even have one single page split for hours, while maybe one index
gets 3x larger in the same timeframe. Eventually you'll need to vacuum
the indexes all the same (not just the bloated index), but that's only
required to enable safely performing heap vacuuming. It's not so bad
if the non-bloated indexes won't be dirtied and if it's not so
frequent (dirtying pages is the real cost to keep under control here).

> One thing I don't know is whether the kind of scenario that I describe
> above is common, i.e. is the main reason we need to vacuum to control
> index bloat, where this kind of approach seems likely to help, or is
> it to reclaim dead line pointers in the heap, where it's not? I'd be
> interested in hearing from people who have some experience in this
> area, or at least better intuition than I do.

The paradox here is:

1. Workload characteristics are important and must be exploited to get
optimal performance.

2. Workloads are too complicated and unpredictable to ever truly understand.

Roughly speaking, the solution that I think has the most promise is to
make it okay for your heuristics to be wrong. You do this by keeping
the costs simple, fixed and low, and by doing things that have
multiple benefits (not just one). This is why it's important to give
VACUUM a bunch of strategies that it can choose from and switch back
and forth from, with minimal commitment -- you let VACUUM figure out
what to do about the workload through trial and error. It has to try
and fail on occasion -- you must be willing to pay the cost of
negative feedback (though the cost must be carefully managed). This
approach is perhaps sufficient to cover all of the possible extremes
with all workloads. I think that the extremes are where our problems
all are, or close to it.

The cost shouldn't be terribly noticeable because you have the
flexibility to change your mind at the first sign of an issue. So you
never pay an extreme cost (you pay a pretty low fixed cost
incrementally, at worst), but you do sometimes (and maybe even often)
get an extreme benefit -- the benefit of avoiding current pathological
performance issues. We know that the cost of bloat is very non-linear
in a bunch of ways that can be pretty well understood, so that seems
like the right thing to focus on -- this is perhaps the only thing
that we can expect to understand with a relatively high degree of
confidence. We can live with a lot of uncertainty about what's going
on with the workload by managing it continually, ramping up and down,
etc.

> Clearly,
> we do not want to vacuum each partition by scanning the 1GB partition
> + the 50MB local index + the 50GB global index. That's insane. With
> the above system, since everything's decoupled, you can vacuum the
> partition tables individually as often as required, and similarly for
> their local indexes, but put off vacuuming the global index until
> you've vacuumed a bunch, maybe all, of the partitions, so that the
> work of cleaning up the global index cleans up dead TIDs from many or
> all partitions instead of just one at a time.

I can't think of any other way of sensibly implementing global indexes.

> Now, the fly in the ointment here is that this supposes that we don't
> get forced into vacuuming the global index too quickly because of dead
> line pointer accumulation. But, I think if that does happen, with
> careful scheduling, we might not really be worse off than we would
> have been without partitioning. If we scan the table for just one
> partition and, say, exhaust maintenance_work_mem, we have some
> expensive index vacuuming to do immediately, but that would've also
> happened in pretty much the same way with an unpartitioned table.

But you can at least drop the partitions with a global index. It
shouldn't be too hard to make that work without breaking things.

> It's probably tricky to get the
> autovacuum algorithm right here, but there seems to be some room for
> optimism.

I think that it's basically okay if global indexes suck when you do
lots of UPDATEs -- this is a limitation that users can probably live
with. What's not okay is if they suck when you do relatively few
UPDATEs. It's especially not okay to have to scan the global index to
delete one index tuple that points to one LP_DEAD item. Since you tend
to get a tiny number of LP_DEAD items even when the DBA bends over
backwards to make all UPDATEs use HOT. Getting that to happen 99%+ of
the time is so much easier than getting it to happen 100% of the time.
There can be enormous asymmetry with this stuff.

Long term, I see VACUUM evolving into something that can only be
configured in an advisory way. It's too hard to tune this stuff
because what we really want here is to structure many things as an
optimization problem, and to have a holistic view that considers how
the table changes over time -- over multiple VACUUM operations. We can
safely be very lazy if we have some basic sense of proportion about
what the risk is. For example, maybe we limit the number of newly
dirtied pages during VACUUM by being lazy about pruning pages that
don't happen to be dirty when encountered within VACUUM. We still have
some sense of how much work we've put off, so as to never get in over
our head with debt. We might also have a sense of how many dirty pages
in total there are in the system currently -- maybe if the DB is not
busy right now we can be extra aggressive. In short, we apply holistic
thinking.

> Even if global indexes never happened, though, I think this could have
> other benefits. For example, the wraparound failsafe mechanism
> recently added by Masahiko Sawada and Peter Geoghegan bypasses index
> vacuuming when wraparound danger is imminent. The only problem is that
> making that decision means throwing away the accumulated list of dead
> TIDs, which then need to be rediscovered whenever we get around to
> vacuuming the indexes. But that's avoidable, if they're stored on disk
> rather than in RAM.

Yeah, that's not ideal.

> One rather serious objection to this whole line of attack is that we'd
> ideally like VACUUM to reclaim disk space without using any more, in
> case the motivation for running VACUUM in the first place. A related
> objection is that if it's sometimes agreable to do everything all at
> once as we currently do, the I/O overhead could be avoided.

Of course it's possible that what we currently do will be optimal. But
it's pretty much a question of mostly-independent things all going the
same way. So I expect that it will be rare.

> I think
> we'd probably have to retain a code path that buffers the dead TIDs in
> memory to account, at least, for the low-on-disk-space case, and maybe
> that can also be used to avoid I/O in some other cases, too. I haven't
> thought through all the details here. It seems to me that the actual
> I/O avoidance is probably not all that much - each dead TID is much
> smaller than the deleted tuple that gave rise to it, and typically you
> don't delete all the tuples at once - but it might be material in some
> cases, and it's definitely material if you don't have enough disk
> space left for it to complete without error.

All true.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Carter 2021-04-22 00:05:53 Re: PATCH: Add GSSAPI ccache_name option to libpq
Previous Message yuzuko 2021-04-21 23:30:52 Re: pgsql: autovacuum: handle analyze for partitioned tables