Re: Autovacuum not keeping up. (PG 9.2.9)

From: jesper(at)krogh(dot)cc
To: jesper(at)krogh(dot)cc
Cc: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject: Re: Autovacuum not keeping up. (PG 9.2.9)
Date: 2014-08-01 09:11:43
Message-ID: ea5b920abfccddd0707920001041d0af.squirrel@shrek.krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

>> jesper(at)krogh(dot)cc wrote:
>>> Hi.
>>>
>>> I have a large database with a message queue table, that has high
>>> activity. The database supports 1-300 client connection concurrently,
>>> having transactions open in up to 30 minutes each.
>>>
>>> Recently I am seeing autuvacuum being issued, but it takes
>>> ages to get through the message queue table, with strace showing
>>> waiting
>>> for semop's for 10's to 100's of seconds.
>>
>> Do you have data on how relfrozenxid advances for that table?
>
> Not really, how would you normally pick that out?
> 2014-08-01 10:49:39.171 db=# select relname, age(relfrozenxid),
> relfrozenxid FROM pg_class WHERE relkind = 'r' and relname = 'job';
> relname | age | relfrozenxid
> ---------+-----------+--------------
> job | 111893622 | 796259097
> (1 row)
>
> Time: 1.913 ms
>
> This shouldn't qualify for a freeze vacuum, should it?
>
>> Vacuuming needs to grab a "cleanup lock" on each page it's going to
>> vacuum, which is a special kind of lock that requires that no other
>> process is even looking at the page at that moment (we call this "to
>> have the page pinned"), which is even weaker than having a shared lock
>> on the page. If traffic to some pages is high, it might be difficult
>> for vacuum to acquire this.
>
> Based on strace output this looks excactly like whats going on, there are
> some activity, then it pauses for some time then continues.

Very "scientifically" i have captured this strace output of the autovacuum
pid. This is a 10 minute trace (600s), total waiting time for
semop(2850903 in those minutes are not less than 485s, (last semop not
counted in).

jk(at)srv01:~$ sudo timelimit -t 600 -T 10 strace -T -p 36839 2>&1 | grep -v
0.000
Process 36839 attached - interrupt to quit
semop(2850903, {{11, -1, 0}}, 1) = 0 <14.403021>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001100>
write(367, "\\D\1\0\30p\325\37\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\340\237
\0\200\206 \0"..., 8192) = 8192 <0.002371>
semop(2850903, {{11, -1, 0}}, 1) = 0 <38.871837>
semop(2850903, {{11, -1, 0}}, 1) = 0 <17.111822>
semop(2850903, {{11, -1, 0}}, 1) = 0 <22.438373>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001095>
semop(2850903, {{11, -1, 0}}, 1) = 0 <32.521132>
semop(2261061, {{8, 1, 0}}, 1) = 0 <0.001601>
semop(2850903, {{11, -1, 0}}, 1) = 0 <10.740716>
semop(1179684, {{9, 1, 0}}, 1) = 0 <0.001900>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001106>
semop(2850903, {{11, -1, 0}}, 1) = 0 <26.610242>
semop(2850903, {{11, -1, 0}}, 1) = 0 <21.911487>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001060>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.001694>
semop(2850903, {{11, -1, 0}}, 1) = 0 <128.996370>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001390>
semop(2850903, {{11, -1, 0}}, 1) = 0 <20.553280>
read(367, "\\D\1\0H\2107\357\1\0\0\0|\3`\22\360\37\4 \0\0\0\0\340\237
\0\320\237 \0"..., 8192) = 8192 <0.001670>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001357>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.001046>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.001352>
semop(2850903, {{11, -1, 0}}, 1) = 0 <44.311165>
semop(2850903, {{11, -1, 0}}, 1) = 0 <79.246222>
read(367, "]D\1\0\10\"\3-\1\0\0\0|\3`\22\360\37\4 \0\0\0\0\340\237
\0\320\237 \0"..., 8192) = 8192 <0.001124>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001093>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001098>
semop(2850903, {{11, -1, 0}}, 1) = 0 <8.440753>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.001390>
semop(2850903, {{11, -1, 0}}, 1) = 0 <27.041010>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.001679>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.004053>
semop(2850903, {{11, -1, 0}}, 1) = 0 <9.158556>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.001015>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001113>
write(367, "]D\1\0\360#\177x\1\0\0\0\344\4\300\f\360\37\4 \0\0\0\0\340\237
\0\300\214 \0"..., 8192) = 8192 <0.001474>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.002450>
semop(655380, {{7, 1, 0}}, 1) = 0 <0.002245>
read(367, "]D\1\0\270\311\247\221\1\0\0\0D\5(at)\v\360\37\4 \0\0\0\0\340\237
\0(at)\213 \0"..., 8192) = 8192 <0.001339>
semop(2850903, {{11, -1, 0}}, 1) = 0 <0.002029>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001078>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001079>
semop(2850903, {{11, -1, 0}}, 1timelimit: sending warning signal 15
<unfinished ...>
Process 36839 detached
jk(at)srv01:~$

Is there any clue in that the semaphorenumber it waits on is excactly the
same all the time? (the <> to the right are timings in seconds.

--
Jesper

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Rural Hunter 2014-08-01 12:37:18 [Help needed] Update query hangs forever on high load
Previous Message jesper 2014-08-01 08:59:36 Re: Autovacuum not keeping up. (PG 9.2.9)