autovacuum stuck on a table for 18+ hours, consuming lots of CPU time

From: Lonni J Friedman <netllama(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: autovacuum stuck on a table for 18+ hours, consuming lots of CPU time
Date: 2011-11-23 02:20:25
Message-ID: CAP=oouFigQF+03dAdvj2LBuzYa_K+i8M4cXJJ=1j-FQVwU1e0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Greetings,
I'm running PostgreSQL-9.0.4 on a Linux-x86_64 cluster with 1 master,
and two streaming replication slaves. Since late yesterday, the load
on the server has been noticably higher (5.00+) than normal (generally
under 1.00). I investigated, and found that for the past ~18 hours,
there's one autovacuum process that has been running, and not making
any obvious progress:

select procpid,query_start,current_query from pg_stat_activity where
current_query LIKE 'autovacuum%' ;
procpid | query_start | current_query
---------+-------------------------------+-----------------------------------------
30188 | 2011-11-21 22:42:26.426315-08 | autovacuum: VACUUM
public.nppsmoketests
(1 row)

select c.oid,c.relname,l.pid,l.mode,l.granted from pg_class c join
pg_locks l on c.oid=l.relation where l.pid='30188' order by l.pid;
oid | relname | pid | mode
| granted
-------+------------------------------------+-------+--------------------------+---------
72112 | nppsmoketests | 30188 |
ShareUpdateExclusiveLock | t
72617 | nppsmoketests_pkey | 30188 | RowExclusiveLock
| t
72619 | nppsmoketests_bug_idx | 30188 | RowExclusiveLock
| t
72620 | nppsmoketests_active_idx | 30188 | RowExclusiveLock
| t
72621 | nppsmoketests_arch_idx | 30188 | RowExclusiveLock
| t
72622 | nppsmoketests_branch_idx | 30188 | RowExclusiveLock
| t
72623 | nppsmoketests_current_status_idx | 30188 | RowExclusiveLock
| t
72624 | nppsmoketests_build_type_idx | 30188 | RowExclusiveLock
| t
72625 | nppsmoketests_gpu_idx | 30188 | RowExclusiveLock
| t
72626 | nppsmoketests_os_idx | 30188 | RowExclusiveLock
| t
72627 | nppsmoketests_owner_idx | 30188 | RowExclusiveLock
| t
72628 | nppsmoketests_regressioncl_idx | 30188 | RowExclusiveLock
| t
72629 | nppsmoketests_subtest_idx | 30188 | RowExclusiveLock
| t
72630 | nppsmoketests_suiteid_idx | 30188 | RowExclusiveLock
| t
72631 | nppsmoketests_suiteid_testname_idx | 30188 | RowExclusiveLock
| t
72632 | nppsmoketests_testcl_idx | 30188 | RowExclusiveLock
| t
72633 | nppsmoketests_testname_idx | 30188 | RowExclusiveLock
| t
80749 | nppsmoketests_osversion_idx | 30188 | RowExclusiveLock
| t
(18 rows)

When I strace PID 30188, I see tons of this scrolling past quickly,
but I'm not really sure what it means beyond a 'Timeout' not looking
good:
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
lseek(95, 753901568, SEEK_SET) = 753901568
read(95, "\202\1\0\0\260\315\250\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
lseek(95, 753917952, SEEK_SET) = 753917952
read(95, "\202\1\0\0 N\253\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
lseek(95, 768606208, SEEK_SET) = 768606208
read(95, "\204\1\0\0h!~\233\1\0\0\0\230\0\360\20\360\37\4
\0\0\0\0x\237\360\0\0\237\360\0"..., 8192) = 8192
lseek(95, 753934336, SEEK_SET) = 753934336
read(95, "\202\1\0\0 &\275\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
read(95, "\202\1\0\0\10\33\276\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
lseek(95, 753958912, SEEK_SET) = 753958912
read(95, "\202\1\0\0x\317\307\245\1\0\0\0\220\0\360\20\360\37\4
\0\0\0\0p\237\0\1\360\236\0\1"..., 8192) = 8192
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
lseek(95, 768614400, SEEK_SET) = 768614400

An old thread suggests that this is a stuck spinlock:
http://archives.postgresql.org/pgsql-performance/2009-05/msg00455.php

I'm using the defaults for all the *vacuum* options in
postgresql.conf, except for:
log_autovacuum_min_duration = 2500

At this point, I'm not sure what I can safely do to either terminate
this autovacuum process, or kick it into making progress again?

thanks

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2011-11-23 02:57:33 Re: synchronous replication + fsync=off?
Previous Message Craig Ringer 2011-11-23 00:17:43 Re: Biztalk adapter for postgresql 9