Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`

From: Brian Ghidinelli <brian(at)vfive(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
Date: 2016-02-18 18:32:49
Message-ID: F7A6B776-6DD9-466E-822D-C28CBC3A8A2B@vfive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Álvaro Herrera wrote:
> This smells like it's looping waiting for a multixact to be fully
> written out ... except that the uninterruptibility part of that was fixed in
> time for 9.4.0,
>
> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Branch: master Release: REL9_5_BR [51f9ea25d] 2014-11-14 15:14:01 -0300
> Branch: REL9_4_STABLE Release: REL9_4_0 [137e4da6d] 2014-11-14 15:14:02 -0300
> Branch: REL9_3_STABLE Release: REL9_3_6 [d45e8dc52] 2014-11-14 15:14:02 -0300
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=137e4da6d
>
>
> Can you attach to the looping process with gdb when it's doing the
> select() dance, and obtain a backtrace? You need debug symbols
> installed.

Alvaro, thanks for the quick response. I’ve done some additional research and found your involvement on other threads in 2014 and 2015 that have allowed me to collect more supporting details. In a nutshell, it appears to be multixid related and /maybe/ pg_upgrade related; I don’t know if we used it on the 9.3 branch or not. I don’t have the debug symbols but I did attach gdb and could see it was working on multixid. I don’t have the ability to compile from source on this production server unfortunately but I’m happy to do whatever else I can to help the team find a solution.

First, I checked the pg_upgrade changelog note in 9.4.4 and here’s what I found in my install:

* pg_controldata DOES report “Latest checkpoint’s oldestMultiXid” = 1
* There IS a file named 0000 in PGDATA/pg_multixact/offsets

pg_controldata:

# /usr/pgsql-9.4/bin/pg_controldata ~postgres/9.4/data/
pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6123001933830237052
Database cluster state: in production
pg_control last modified: Thu 18 Feb 2016 09:53:29 AM PST
Latest checkpoint location: 130/6FBD9E58
Prior checkpoint location: 130/6F9ED950
Latest checkpoint's REDO location: 130/6FAEEFC0
Latest checkpoint's REDO WAL file: 00000001000001300000006F
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/90358137
Latest checkpoint's NextOID: 5523460
Latest checkpoint's NextMultiXactId: 84342
Latest checkpoint's NextMultiOffset: 167545
Latest checkpoint's oldestXID: 654
Latest checkpoint's oldestXID's DB: 16416
Latest checkpoint's oldestActiveXID: 90358137
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16416
Time of latest checkpoint: Thu 18 Feb 2016 09:52:31 AM PST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 100
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

the autovacuum log status:

# select pg_class.relname, pg_locks.transactionid, pg_locks.mode,
pg_locks.granted as "g", left(pg_stat_activity.query, 60) as query,
pg_stat_activity.query_start,
age(now(),pg_stat_activity.query_start) as "age",
pg_stat_activity.pid
from pg_stat_activity,pg_locks
left outer join pg_class on (pg_locks.relation = pg_class.oid)
where pg_locks.pid=pg_stat_activity.pid
and pg_stat_activity.pid = 14578
order by query_start;

relname | transactionid | mode | g | query | query_start | age | pid
-----------------------------------------+---------------+--------------------------+---+-----------------------------------------------+-------------------------------+-----------------+-------
idx_clubmember_memberid | | RowExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
idx_lookupclubmember_uidmember | | RowExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
idx_tbllookupclubmember_uidclubofrecord | | RowExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
xpktbllookupclubmember_mem | | RowExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
tbllookupclubmember_pkey | | RowExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
| | ExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578
tbllookupclubmember | | ShareUpdateExclusiveLock | t | autovacuum: VACUUM public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578

gdb bt for pid 14578:

Program received signal SIGINT, Interrupt.
0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6
(gdb) bt
#0 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6
#1 0x000000000077aa9a in pg_usleep ()
#2 0x00000000004b66af in GetMultiXactIdMembers ()
#3 0x0000000000487b93 in heap_prepare_freeze_tuple ()
#4 0x0000000000585e07 in lazy_vacuum_rel ()
#5 0x0000000000583a75 in ?? ()
#6 0x0000000000583e37 in vacuum ()
#7 0x0000000000616a06 in ?? ()
#8 0x00000000006171db in ?? ()
#9 0x0000000000617289 in StartAutoVacWorker ()
#10 0x0000000000624cf2 in ?? ()
#11 <signal handler called>
#12 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6
#13 0x0000000000621f76 in ?? ()
#14 0x000000000062619a in PostmasterMain ()
#15 0x00000000005bd52d in main ()
(gdb) cont
Continuing.

A couple other things I saw you request in other threads:

# select oid::regclass, relfrozenxid, relminmxid from pg_class where relname = 'tbllookupclubmember';
oid | relfrozenxid | relminmxid
---------------------+--------------+------------
tbllookupclubmember | 22915818 | 616
(1 row)

# select datname, datfrozenxid, datminmxid from pg_database where datname = 'prod';
datname | datfrozenxid | datminmxid
--------------------+--------------+------------
prod | 654 | 1

# ls -al ~postgres/9.4/data/pg_multixact/offsets/
total 368
drwx------ 2 postgres postgres 4096 Mar 5 2015 .
drwx------ 4 postgres postgres 4096 Mar 5 2015 ..
-rw------- 1 postgres postgres 262144 Mar 5 2015 0000
-rw------- 1 postgres postgres 81920 Feb 18 09:47 0001

In this thread: http://postgresql.nabble.com/BUG-11264-Auto-vacuum-wraparound-job-blocking-everything-td5816237.html, I saw you recommend this fix:

> > How do we fix the current issue with this one server? Is there an easy
> >fix?
> > Thanks.
>
> As far as I am aware, you should
> UPDATE pg_database SET datminmxid=20783
>
> and that should fix it. The oldestMulti value in pg_control would get
> updated by itself some time later. If you experience stalls before
> oldestMulti fixes itself, you could stop the server (cleanly!) and then
> pg_resetxlog -m x,y where x is the correct nextMulti value from
> pg_controldata and y is 20783.

Is that something I should/could do here? I’ve temporarily disabled autovacuum on the offending table until we can figure it out.

Thank you!

Brian

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jim Nasby 2016-02-18 20:47:17 Re: BUG #13964: unexpected result from to_tsvector
Previous Message Jeff Janes 2016-02-18 17:17:15 Re: BUG #13971: SysV shm is used regardless of config