Re: Hung Vacuum in 8.3

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Hung Vacuum in 8.3
Date: 2011-03-08 03:14:42
Message-ID: 4D759F22.7020809@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 08/03/11 13:03, Mark Kirkwood wrote:
> On 08/03/11 12:55, Mark Kirkwood wrote:
>> On 23/02/11 10:18, Mark Kirkwood wrote:
>>> On 23/02/11 00:26, Greg Stark wrote:
>>>>
>>>> It's also possible there's a bug of course. If someone was using that
>>>> buffer and somehow failed to notify the vacuum that they were done it
>>>> would wait for a very long time (forever?). However if vacuum
>>>> eventually continued when the query was canceled then it seems likely
>>>> it was working as intended.
>>>>
>>>
>>> Greg, thanks for clarifying this.
>>>
>>> Unfortunately this time around I canceled the vacuum and then the
>>> query. However *next* time I'll get rid of the query 1st and see
>>> what happens.
>>>
>>
>> I have another one here:
>>
>> postgres=# select datname,procpid, query_start, current_query
>> from pg_stat_activity
>> where now() - query_start > '2 hours'::interval
>> order by query_start;
>> datname | procpid | query_start | current_query
>> ---------+---------+-------------------------------+-----------------
>> stuff | 23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>>
>> postgres=# select datname,procpid, query_start, current_query
>> from pg_stat_activity
>> where now() - backend_start > '2 hours'::interval
>> order by query_start;
>> datname | procpid | query_start | current_query
>> ---------+---------+-------------------------------+-----------------
>> stuff | 23957 | 2011-03-08 06:02:02.659159+13 | VACUUM ANALYZE;
>> :
>> stuff | 30018 | 2011-03-08 11:57:03.214177+13 | <IDLE>
>> stuff | 11390 | 2011-03-08 12:27:20.390122+13 | <IDLE>
>> stuff | 10178 | 2011-03-08 12:28:59.328265+13 | <IDLE>
>> stuff | 5100 | 2011-03-08 12:30:10.318204+13 | <IDLE>
>> stuff | 10255 | 2011-03-08 12:32:21.520814+13 | <IDLE>
>> stuff | 29993 | 2011-03-08 12:32:26.562835+13 | <IDLE>
>> stuff | 10204 | 2011-03-08 12:40:57.059958+13 | <IDLE>
>> stuff | 10206 | 2011-03-08 12:41:11.708924+13 | <IDLE>
>> (9 rows)
>>
>>
>> It looks to me like whatever query buffer pins the vacuum *was*
>> waiting on should be well and truly over by now. The strace for the
>> vacuum looks the same as before:
>>
>> $ strace -p 23957
>> Process 23957 attached - interrupt to quit
>> semop(39747613, 0x7fff53d3dec0, 1
>>
>>
>> It certainly looks like the vacuum is not able to resume after
>> setting itself to sleep for some reason.
>>
>>
>>
>
> Immediately after I sent this, I saw the vacuum unlock itself and
> continue... so there *are* some queries still blocking it. I'm seeing
> some highish (1 -2 minute) lock waits for a certain table:
>
> postgres=# select datname,procpid, query_start, current_query
> from pg_stat_activity where now() - query_start > '1 minute'
> and procpid in (select distinct pid from pg_locks where not
> granted) order by query_start;
> datname | procpid | query_start | current_query
> ---------+---------+-------------------------------+------------------------------
>
> stuff | 390 | 2011-03-08 12:57:01.752234+13 | UPDATE
> content.twitter_queue
> : SET error = $1
> : WHERE
> twitter_queue_id = $2
> (30 or so)
>
> which are probably not helping.
>
>

Make that definitely not helping:

stuff=# select relname, last_vacuum
from pg_stat_all_tables
where last_vacuum is not null
and last_vacuum> '2011-03-08 06:03:06'::timestamp
order by last_vacuum limit 3;
relname | last_vacuum
----------------+-------------------------------
headline_alert | 2011-03-08 06:03:06.182866+13
stock_static | 2011-03-08 06:03:06.610739+13
twitter_queue | 2011-03-08 12:58:03.02451+13
(3 rows)

So we were trying to vacuum 'twitter_queue' (a small 20MB table) for
approx 6 hrs.

Note that the vacuum itself was not waiting in any locks according to
pg_locks, so looks to be the pinning issue. Clearly we need to amend the
application design that results in this quite long (1-2 minute)! wait
for the lock on this table.

regards

Mark

P.s: for the interested, the process(es) that the above UPDATEs are
waiting for are sitting '<IDLE> in transaction' - i.e application end is
doing some *non database* blocking call whilst holding a lock on the
'twitter_queue' table :-/

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Matyas Novak 2011-03-08 03:49:19 BUG #5920: SQL function are probably "overoptimized"
Previous Message Allan E. Registos 2011-03-08 02:23:21 Re: BUG #5916: PGAdmin crash