Re: backends stuck in "startup"

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-general(at)postgresql(dot)org
Subject: Re: backends stuck in "startup"
Date: 2017-11-23 01:24:50
Message-ID: 20171123012450.GC5668@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby(at)telsasoft(dot)com> writes:
> > For starters, I found that PID 27427 has:
>
> > (gdb) p proc->lwWaiting
> > $1 = 0 '\000'
> > (gdb) p proc->lwWaitMode
> > $2 = 1 '\001'
>
> To confirm, this is LWLockAcquire's "proc", equal to MyProc?
> If so, and if LWLockAcquire is blocked at PGSemaphoreLock,
> that sure seems like a smoking gun.

Right:

(gdb) bt
#0 0x0000003f6ee0d930 in ?? ()
#1 0x00007f19f3400d38 in ?? ()
#2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
#3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
#4 0x00000000006f75e5 in BufferAlloc (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5 ReadBuffer_common (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:745
[...]

(gdb) up
#1 0x00007f19f3400d38 in ?? ()
(gdb) up
#2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
310 errStatus = sem_wait(PG_SEM_REF(sema));
(gdb) up
#3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
1233 PGSemaphoreLock(proc->sem);
(gdb) p proc->lwWaiting
$1 = 0 '\000'
(gdb) p proc->lwWaitMode
$2 = 1 '\001'
(gdb) p proc
$3 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc
$4 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc==proc
$5 = 1

I suppose one needs to show that this backend was stuck and not "just
happened to start 1ms before gdb connected", which I actually have:
postgres 27427 17665 0 16:57 ? 00:00:00 postgres: indigo indigo 207.241.169.248(37226) startup
...
postgres 30298 17665 0 17:01 ? 00:00:00 postgres: main main 207.241.169.100(43088) startup

..so that process was in "startup" 3-4 minutes after being forked.

> My hypothesis about a missed memory barrier would imply that there's (at
> least) one process that's waiting but is not in the lock's wait queue and

Do I have to also check the wait queue to verify? Give a hint/pointer please?

> > Would you suggest how I can maximize the likelyhood/speed of triggering that ?
> > Five years ago, with a report of similar symptoms, you said "You need to hack
> > pgbench to suppress the single initialization connection it normally likes to
> > make, else the test degenerates to the one-incoming-connection case"
> > https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us
>
> I don't think that case was related at all.
>
> My theory suggests that any contended use of an LWLock is at risk,
> in which case just running pgbench with about as many sessions as
> you have in the live server ought to be able to trigger it. However,
> that doesn't really account for your having observed the problem
> only during session startup,

Remember, this issue breaks existing sessions, too.

After it'd happened a couple times, I made sure to leave a session opened to
allow collecting diagnostics (and try to un-stick it), but when it recurred,
was unable to even SELECT * FROM pg_locks.

Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes

Justin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2017-11-23 01:44:58 Re: backends stuck in "startup"
Previous Message Tom Lane 2017-11-23 00:57:36 Re: query causes connection termination