backends stuck in "startup"

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: backends stuck in "startup"
Date: 2017-11-21 23:09:26
Message-ID: 20171121230926.GA17194@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'm sorry to report this previously reported problem is happening again,
starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.

As $subject: backends are stuck in startup for minutes at a time. I didn't
strace this time, but I believe last time I saw one was waiting in a futex.

postgres 25271 17665 0 16:54 ? 00:00:00 postgres: oooo oooo a.b.c.246(59980) startup
postgres 27123 17665 0 16:55 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59919) startup
postgres 27142 17665 0 16:55 ? 00:00:00 postgres: main main a.b.c.100(42892) startup
postgres 27224 17665 0 16:56 ? 00:00:00 postgres: main main a.b.c.100(42898) startup
postgres 27270 17665 0 16:56 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39760) startup
postgres 27271 17665 0 16:56 ? 00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37224) startup
postgres 27873 17665 0 16:58 ? 00:00:00 postgres: wwwwwwww wwwwwwww a.b.c.253(59667) startup
postgres 30038 17665 0 17:00 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39780) startup
postgres 30039 17665 0 17:00 ? 00:00:00 postgres: oooo oooo a.b.c.246(60014) startup
postgres 30040 17665 0 17:00 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59951) startup
postgres 30041 17665 0 17:00 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59952) startup
postgres 30042 17665 0 17:00 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39788) startup
postgres 30043 17665 0 17:00 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44819) startup
postgres 30044 17665 0 17:00 ? 00:00:00 postgres: nnn nnn a.b.c.250(41317) startup
postgres 30045 17665 0 17:00 ? 00:00:00 postgres: mmmmmm mmmmmm a.b.c.249(47876) startup
postgres 30046 17665 0 17:00 ? 00:00:00 postgres: nnn nnn a.b.c.250(41320) startup
postgres 30048 17665 0 17:00 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44834) startup
postgres 30051 17665 0 17:00 ? 00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37244) startup
postgres 30052 17665 0 17:00 ? 00:00:00 postgres: iiiiiii iiiiiii a.b.c.248(37247) startup
postgres 30054 17665 0 17:00 ? 00:00:00 postgres: nnn nnn a.b.c.250(41327) startup
postgres 30079 17665 0 17:00 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44843) startup
postgres 30081 17665 0 17:00 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39790) startup
postgres 30083 17665 0 17:00 ? 00:00:00 postgres: pppp pppp a.b.c.251(50397) startup
postgres 30084 17665 0 17:00 ? 00:00:00 postgres: main main a.b.c.100(43074) startup
postgres 30604 17665 0 17:02 ? 00:00:00 postgres: oooo oooo a.b.c.246(60018) startup
postgres 30614 17665 0 17:02 ? 00:00:00 postgres: nnn nnn a.b.c.250(41331) startup
postgres 30624 17665 0 17:02 ? 00:00:00 postgres: aaaaa aaaaa a.b.c.245(44847) startup
postgres 30841 17665 0 17:03 ? 00:00:00 postgres: mmmmmmmm mmmmmmmm a.b.c.247(59961) startup
postgres 31023 17665 0 17:03 ? 00:00:00 postgres: postgres main [local] startup
postgres 31024 17665 0 17:03 ? 00:00:00 postgres: wwwwww wwwwww a.b.c.252(39796) startup

I found these older, similar reports:
https://www.postgresql.org/message-id/flat/4F9AC452(dot)3050102%40pgexperts(dot)com#4F9AC452(dot)3050102(at)pgexperts(dot)com
https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us

I saved ~40 cores from backends from the most recent incident, which are all
essentially identical:

(gdb) bt
#0 0x0000003f6ee0d930 in ?? ()
#1 0x00007f19f3401258 in ?? ()
#2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3401258) at pg_sema.c:310
#3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
#4 0x00000000006f75e5 in BufferAlloc (smgr=0x1f3bc48, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0,
hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5 ReadBuffer_common (smgr=0x1f3bc48, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "")
at bufmgr.c:745
#6 0x00000000006f800e in ReadBufferExtended (reln=0x1ee6a08, forkNum=MAIN_FORKNUM, blockNum=1, mode=<value optimized out>,
strategy=<value optimized out>) at bufmgr.c:664
#7 0x00000000004c090a in _bt_relandgetbuf (rel=0x1ee6a08, obuf=145, blkno=<value optimized out>, access=1) at nbtpage.c:708
#8 0x00000000004c1284 in _bt_getroot (rel=0x1ee6a08, access=<value optimized out>) at nbtpage.c:298
#9 0x00000000004c46eb in _bt_search (rel=0x1ee6a08, keysz=1, scankey=0x7fff6ef1d040, nextkey=0 '\000', bufP=0x7fff6ef1da4c, access=1, snapshot=0x1f1a648)
at nbtsearch.c:103
#10 0x00000000004c5d1d in _bt_first (scan=0x1f1a4a8, dir=ForwardScanDirection) at nbtsearch.c:1028
#11 0x00000000004c39dd in btgettuple (scan=0x1f1a4a8, dir=ForwardScanDirection) at nbtree.c:372
#12 0x00000000004bcf1a in index_getnext_tid (scan=0x1f1a4a8, direction=<value optimized out>) at indexam.c:541
#13 0x00000000004bcfca in index_getnext (scan=0x1f1a4a8, direction=ForwardScanDirection) at indexam.c:679
#14 0x00000000004bc549 in systable_getnext (sysscan=0x1f1ae68) at genam.c:422
#15 0x000000000080fac7 in SearchCatCache (cache=0x1ece848, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>,
v4=<value optimized out>) at catcache.c:1178
#16 0x0000000000844e7d in superuser_arg (roleid=33320) at superuser.c:71
#17 0x0000000000832917 in InitPostgres (in_dbname=0x1e9eec8 "iiiiiii", dboid=0, username=0x1e71738 "iiiiiii", useroid=0, out_dbname=0x0) at postinit.c:742
#18 0x00000000007239f2 in PostgresMain (argc=1, argv=<value optimized out>, dbname=0x1e9eec8 "iiiiiii", username=0x1e71738 "iiiiiii") at postgres.c:3735
#19 0x00000000006b7d7a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4357
#20 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4029
#21 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1753
#22 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1361
#23 0x0000000000638ad0 in main (argc=3, argv=0x1e6f5b0) at main.c:228

(gdb) p *reln
$2 = {rd_node = {spcNode = 1663, dbNode = 16405, relNode = 2679}...

nep=# SELECT relname FROM pg_class WHERE oid=2679;
relname | pg_index_indexrelid_index

For the largest DB:
main=# \dt+ pg_class|pg_attribute|pg_index
pg_catalog | pg_attribute | table | postgres | 227 MB |
pg_catalog | pg_class | table | postgres | 6752 kB |
pg_catalog | pg_index | table | postgres | 3448 kB |

During normal operation, after restarting following this incident:

postgres=# SELECT COUNT(1) FROM pg_stat_activity;
count | 85

shared_buffers | 2GB
max_connections | 400
effective_cache_size | 4GB
maintenance_work_mem | 512MB
effective_io_concurrency | 2
max_locks_per_transaction | 64
max_parallel_workers_per_gather | 3
password_encryption | scram-sha-256

I can probably collect other diagnostics if needed, but otherwise it's possible
I'll need to reload this DB under PG96.

..but, while this issue is occuring, I can't make simple queries
(like SELECT * FROM pg_locks)

Notes:
- we DO use temp tables; it looks like we created up to 90 temp tables each hour;
- we're still running analyze following yesterday's pg_upgrade;
- I have a script saving something from pg_buffercache, but it doesn't appear
to be running during this last incident;
- I'm planning do disable sync seqscan for the night (although as I understand
that may not matter (?))
- this instance has clients connecting to ~10 DBs;

Thanks in advance
Justin

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2017-11-21 23:21:16 Re: backends stuck in "startup"
Previous Message Tom Lane 2017-11-21 22:37:23 Re: [GENERAL] Query Using Massive Temp Space