Re: Autovacuum fails to keep visibility map up-to-date in mostly-insert-only-tables

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alexey Bashtanov <bashtanov(at)imap(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Autovacuum fails to keep visibility map up-to-date in mostly-insert-only-tables
Date: 2014-10-20 20:11:14
Message-ID: 20141020201114.GG7176@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-10-19 20:43:29 -0500, Jim Nasby wrote:
> On 10/19/14, 11:41 AM, Andres Freund wrote:
> >On 2014-10-18 21:36:48 -0500, Jim Nasby wrote:
> >>The "weird" part is that if it's not doing a freeze it will just punt
> >>on a page if it can't get the cleanup lock.
> >
> >I don't think that's particularly wierd. Otherwise vacuum can get stuck
> >behind a single very hot page - leading to much, much more bloat.
> >
> >>I have to believe that could seriously screw up autovacuum scheduling.
> >
> >Why?
>
> I'm worried there could be some pathological cases where we'd skip a
> large number of pages, perhaps if a vacuum scan and a seqscan ended up
> running alongside each other.

I've seen little evidence of that. The reverse, a stuck autovacuum, is
imo much more likely. For this to be an actual problem you'd need to
encounter many pages that are not locked, but are pinned. That state
doesn't exist for very long.

> Perhaps this is just paranoia, but we have no idea how bad things
> might be, because we don't have any logging for how many pages we
> skipped because we couldn't lock them.

But so what? If we skip individual pages it won't be too bad - and very
likely waiting very long is going to be more painful. The page won't be
marked 'all visible' so the next vacuum will come around to it
again. And it'll also get cleaned up by opportunistic hot pruning.

> Also, if this really is that big a deal for heap pages, how come we
> don't get screwed by it on Btree index pages, where we mandate that we
> acquire a cleanup lock?

Because we never hold pins for btree pages for very long. Whereas we do
that for heap pages. If you e.g. run a cursor forward you can hold a pin
for essentially unbounded time.

> >>Now that we have forks, I'm wondering if it would be best to come up
> >>with a per-page system that could be used to determine when a table
> >>needs background work to be done. The visibility map could serve a lot
> >>of this purpose, but I'm not sure if it would work for getting hint
> >>bits set in the background.
> >
> >It would. Per definition, all tuples that are 'all visible' need to be
> >fully hint bitted.
> >
> >>I think it would also be a win if we had a way to advance relfrozenxid
> >>and relminmxid. Perhaps something that simply remembered the last XID
> >>that touched each page...
> >
> >Not sure what you're getting at here?
>
> That ultimately, our current method for determining when and what to
> vacuum is rather crude, and likely results in wasted effort during
> scans as well as not firing autovac often enough. Keep in mind that
> autovac started as a user-space utility and the best it could possibly
> do was to keep a table of stats counters.

I agree that we should trigger autovacuum more often. It's
*intentionally* not triggered *at all* for insert only workloads (if you
discount anti wraparound vacuums). I think it's time to change that. For
that we'd need to make vacuums that don't delete any tuples cheaper. We
already rescan only the changed parts of the heaps - but we always scan
indexes fully...

> The visibility map obviously helps cut down on extra work during a
> scan, but it only goes so far in that regard.

Aha.

> Instead of relying on the crude methods, if we reliably tracked
> certain txids on a per-block basis in a fork, we could cheaply scan
> the fork and make an extremely informed decision on how much a vacuum
> would gain us, and exactly what blocks it should hit.

> Let me use freezing as an example. If we had a reliable list of the
> lowest txid for each block of a relation that would allow us to do a
> freeze scan by hitting only blocks with minimum txid within our freeze
> range. The same could be done for multixacts.

It'd also become a prime contention point because you'd need to
constantly update it. In contrast to a simple 'is frozen' bit (akin to
is_visible) which only changes infrequently, and only in one direction.

> If we stored 3 txids for each block in a fork, we could fit
> information for ~680 heap blocks in each fork block. So in a database
> with 680G of heap data, we could fully determine every *block* (not
> table) we needed to vacuum by scanning just 1GB of data. That would
> allow for far better autovacuum scheduling than what we do today.

It's not that simple. Wraparounds and locking complicate it
significantly.

> >I think the big missing piece lest something like Heikki's xid lsn
> >ranges thing gets finished is a freeze map.
>
> The problem with a simple freeze map is when do you actually set the
> bit?

There's precisely one place where you can set it for normal
operation. During vacuum's scan.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-10-20 20:24:47 Re: narwhal and PGDLLIMPORT
Previous Message jesper 2014-10-20 20:02:05 Re: wal-size limited to 16MB - Performance issue for subsequent backup