Re: backends stuck in "startup"

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: backends stuck in "startup"
Date: 2017-11-22 06:47:02
Message-ID: 20171122064702.GQ5668@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> Hi,
>
> On 2017-11-21 17:09:26 -0600, Justin Pryzby wrote:
> > I'm sorry to report this previously reported problem is happening again,
> > starting shortly after pg_upgrading a customer to PG10.1 from 9.6.5.
> >
> > As $subject: backends are stuck in startup for minutes at a time. I didn't
> > strace this time, but I believe last time I saw one was waiting in a futex.
>
> Could you try stracing next time?

I straced all the "startup" PIDs, which were all in futex, without exception:

[pryzbyj(at)TS-DB ~]$ sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/startup$/{print "-p"$1}'` 2>&1 |tee -a /tmp/strace-pg-backends

Process 3611 attached
Process 6081 attached
[pid 6081] futex(0x7ff4314000f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 3611] futex(0x7ff431400138, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 3354] futex(0x7ff431400258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 3117] futex(0x7ff4314002b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2948] futex(0x7ff431400378, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2925] futex(0x7ff4314003d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2789] futex(0x7ff431400438, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2752] futex(0x7ff4314004b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2663] futex(0x7ff431400618, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2661] futex(0x7ff431400638, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2624] futex(0x7ff431400678, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 2576] futex(0x7ff4314006f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 365] futex(0x7ff431400718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 326] futex(0x7ff431400798, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32704] futex(0x7ff431400898, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32309] futex(0x7ff431400938, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32292] futex(0x7ff431400a18, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32170] futex(0x7ff431400ad8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 32134] futex(0x7ff431400b58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31987] futex(0x7ff431400bf8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31962] futex(0x7ff431400c58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31817] futex(0x7ff431400d38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31807] futex(0x7ff431400d98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29664] futex(0x7ff431400e98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29462] futex(0x7ff431400ef8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29426] futex(0x7ff431400f38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29258] futex(0x7ff431401058, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29227] futex(0x7ff431402f58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29107] futex(0x7ff431401138, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 29071] futex(0x7ff431401198, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 28942] futex(0x7ff431401258, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 28924] futex(0x7ff431401298, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26942] futex(0x7ff431401418, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 25278] futex(0x7ff431401438, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24909] futex(0x7ff431401598, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24908] futex(0x7ff4314015b8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24852] futex(0x7ff4314015f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24851] futex(0x7ff431401618, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24585] futex(0x7ff4314016f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24584] futex(0x7ff431401718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24478] futex(0x7ff431401918, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24476] futex(0x7ff431401958, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24470] futex(0x7ff4314019d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24464] futex(0x7ff431401ad8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24458] futex(0x7ff431401bb8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24456] futex(0x7ff431401b98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24450] futex(0x7ff431401c58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24448] futex(0x7ff431401c98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24439] futex(0x7ff431401cf8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24323] futex(0x7ff431401e38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 24322] futex(0x7ff431401e58, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23057] futex(0x7ff431401fd8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23055] futex(0x7ff431402038, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21826] futex(0x7ff4314021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21824] futex(0x7ff431402238, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21794] futex(0x7ff431402298, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21792] futex(0x7ff431402638, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21791] futex(0x7ff431402af8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21790] futex(0x7ff431402c98, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21786] futex(0x7ff431402478, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21785] futex(0x7ff431402758, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21784] futex(0x7ff431403098, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21768] futex(0x7ff431402718, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21662] futex(0x7ff431402338, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21648] futex(0x7ff431402c38, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21630] futex(0x7ff4314024d8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21591] futex(0x7ff431402278, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21590] futex(0x7ff431402bf8, FUTEX_WAIT, 0, NULL

Also:

[pryzbyj(at)TS-DB ~]$ date;ps -O lstart,wchan=9999999999999999999999 -u postgres --sort start_time |grep startup$
Wed Nov 22 00:47:41 EST 2017
1990 Wed Nov 22 00:36:44 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53934) startup
2007 Wed Nov 22 00:36:44 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53945) startup
2014 Wed Nov 22 00:36:44 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53946) startup
2042 Wed Nov 22 00:36:48 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51353) startup
2043 Wed Nov 22 00:36:48 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.252(52122) startup
2047 Wed Nov 22 00:36:49 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46696) startup
2049 Wed Nov 22 00:36:49 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.249(60135) startup
2052 Wed Nov 22 00:36:50 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53947) startup
2095 Wed Nov 22 00:37:13 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51355) startup
2096 Wed Nov 22 00:37:14 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.249(60137) startup
2097 Wed Nov 22 00:37:15 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53949) startup
2178 Wed Nov 22 00:37:43 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46698) startup
2188 Wed Nov 22 00:38:00 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.252(52124) startup
2331 Wed Nov 22 00:38:30 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51357) startup
2332 Wed Nov 22 00:38:30 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.249(60139) startup
2333 Wed Nov 22 00:38:31 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53951) startup
2529 Wed Nov 22 00:39:00 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46700) startup
2530 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.248(51358) startup
2533 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.247(53952) startup
2535 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.252(52125) startup
2549 Wed Nov 22 00:39:01 2017 futex_wait_queue_me S ? 00:00:00 postgres: xx xx xx.253(46701) startup
[...]

> Don't think it's conclusive that they're all continually waiting rather than
> contending on a few locks. As soon as you attach to one backend a contended
> lock situation can quickly turn into standstill if you end up interrupting a
> process that holds a lock.

I didn't understand what you mean.

By "attach" do you mean attach with strace? gdb? Or ??

I also tried stracing all (400) postgres PIDs, which didn't magically unlock them..
sudo strace `ps h --sort start_time -o pid,args -u postgres |awk '/|startup$/{print "-p"$1}'` 2>> /tmp/strace-pg-backends3 &

FYI these AREN'T being rebuilt very often:

[pryzbyj(at)TS-DB ~]$ sudo sh -c "find /var/lib/pgsql/ -name 'pg_internal.init' -print0 |xargs -r0 ls -lt"
-rw------- 1 postgres postgres 136164 Nov 21 21:50 /var/lib/pgsql/10/data/base/16404/pg_internal.init
-rw------- 1 postgres postgres 21340 Nov 21 21:50 /var/lib/pgsql/10/data/global/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:47 /var/lib/pgsql/10/data/base/16406/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:32 /var/lib/pgsql/10/data/base/16410/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:28 /var/lib/pgsql/10/data/base/16401/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:21 /var/lib/pgsql/10/data/base/16403/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/13451/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 21:13 /var/lib/pgsql/10/data/base/16408/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:59 /var/lib/pgsql/10/data/base/16411/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:56 /var/lib/pgsql/10/data/base/16402/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/1/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:01 /var/lib/pgsql/10/data/base/16409/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16407/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 20:00 /var/lib/pgsql/10/data/base/16405/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 19:59 /var/lib/pgsql/10/data/base/16400/pg_internal.init
-rw------- 1 postgres postgres 136164 Nov 21 17:17 /var/lib/pgsql/10/data/base/13450/pg_internal.init
[pryzbyj(at)TS-DB ~]$ date
Wed Nov 22 00:45:14 EST 2017

And it's not a minor overhead:

[pryzbyj(at)TS-DB ~]$ time psql newn -c 'SELECT 1'
psql: FATAL: terminating connection due to administrator command
real 10m22.089s

Justin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message legrand legrand 2017-11-22 08:01:34 Re: One Partition by list is always chosen by planner
Previous Message Tom Lane 2017-11-22 05:37:07 Re: On Judging the Value of Tests