Re: Speeding up an in-progress wraparound-preventing vacuum

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Vincent de Phily <vincent(dot)dephily(at)mobile-devices(dot)fr>
Cc: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Speeding up an in-progress wraparound-preventing vacuum
Date: 2014-12-08 18:17:37
Message-ID: CAMkU=1wzZvnEyt+sAFkMFE1ms0861KKQ4eiYMM7Ae7bhP6zFYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Dec 8, 2014 at 4:54 AM, Vincent de Phily <
vincent(dot)dephily(at)mobile-devices(dot)fr> wrote:

> Hi List,
>
> I have a "autovacuum: VACUUM ANALYZE public.some_big_table (to prevent
> wraparound)" that has been running for over 13 days. The process is
> consuming
> IO so I'm confident it isn't stuck, but it's still taking surprisingly
> long.
> PG 9.1.13 on Debian.
>
> The actual table is 584G on a SAN, plus 324G of indexes on local disk. The
> system was IO-starved until about 5 days ago, after offloading half the
> work
> to a different server and waiting for the purging of old data (went from
> keeping 4 weeks to 2) to catch up (so probably 2/3rd of the table is
> currently
> bloat, which I'd like to get back asap). Currently about 80% of the IO is
> devoted to the vacuum process (on average throughout the day, as
> extrapolated
> from atop output).
>

Is that 80% of the actually occurring IO, or 80% of the maximum possible IO?

>
> I've tried raising autovacuum_vacuum_cost_limit from 500 to 5000, but
> apparently the already-running vacuum process didn't pick up the change (I
> did
> send a sighup, and new vacuum processes do run faster). I tried running a
> manual vacuum on that table (to benefit from the new settings and from the
> more aggressive behavior of manual vacuums), but it's apparently waiting
> for
> the wraparound vacuum to finish first.
>
> My currrent settings:
> autovacuum | on | default
> autovacuum_analyze_scale_factor | 0.1 | default
> autovacuum_analyze_threshold | 10000 | configuration file
> autovacuum_freeze_max_age | 200000000 | default
> autovacuum_max_workers | 3 | default
> autovacuum_naptime | 60 | default
> autovacuum_vacuum_cost_delay | 10 | configuration file
> autovacuum_vacuum_cost_limit | 5000 | configuration file
> autovacuum_vacuum_scale_factor | 0.002 | configuration file
> autovacuum_vacuum_threshold | 10000 | configuration file
> maintenance_work_mem | 524288 | configuration file
> The vacuum_scale_factor is tweaked to trigger once a day for most tables,
> and
> should trigger every 3 days on the problem table.
>

Increasing maintenance_work_mem even further, at least temporarily and
locally for this operation, might be a good idea.

> My first question is: is there a way to speedup the currently-running
> vacuum
> process ? I guess killing it to let it be replaced by a more agressively-
> configured one would work, but I'd hate to lose 2 weeks of processing (is
> there a way to estimate how much more vacuuming work remains to be done ?),
> and I'm being a bit more cautious with the wraparound-preventing kind.
>

I don't know of a way to speed it up gracefully. That has frustrated me a
few times, and a solution would really be nice.

If you kill it, the new process will have to re-read the entire table, but
it will have much less work to do since the killed process already 'plowed
a path' for it. In a sense, killing the process will throw away all of the
sequential read work on the table that has already been done, but the index
reads and the writing workload is not all lost, it will save the new
process time on those.

You can `strace` for the lseek command to see which file handles it is
currently working on, and
use lsof to turn those into names. You want to look at where it is in the
table files, not the index files.

>
> Second question is: how come we reached the wraparound threshold on this
> table
> at all ? We've only been keeping 28 days of data in this table, doing
> incremental deletes every day (there are no updates at all). I find it very
> unlikely that we'd go through 2M transactions in that timespan (that'd need
> 890 transactions per second, we're well below 100). The
> pg_class.relfozenxid
> on that table is at 680M, while most other tables are around 860M. Could
> it be
> that routine vacuums haven't been able to update the relfrozenxid in a long
> long time, or am I missing something else ?
>

PostgreSQL doesn't know what your delete routine is like. It has to verify
with its own eyes that there are
no rows over a certain age.

I don't think that routine vacuums even attempts to update relfrozenxid, or
at least doesn't try very hard.

Are you sure that routine vacuums have been running to completion on this
table, as opposed to getting interrupted by something before finishing each
time?

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2014-12-08 18:53:35 Re: Divergences in view source code - both servers 9.3.5
Previous Message Edson Richter 2014-12-08 18:13:50 Divergences in view source code - both servers 9.3.5