From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: backends stuck in "startup" |
Date: | 2017-11-23 01:24:50 |
Message-ID: | 20171123012450.GC5668@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby(at)telsasoft(dot)com> writes:
> > For starters, I found that PID 27427 has:
>
> > (gdb) p proc->lwWaiting
> > $1 = 0 '\000'
> > (gdb) p proc->lwWaitMode
> > $2 = 1 '\001'
>
> To confirm, this is LWLockAcquire's "proc", equal to MyProc?
> If so, and if LWLockAcquire is blocked at PGSemaphoreLock,
> that sure seems like a smoking gun.
Right:
(gdb) bt
#0 0x0000003f6ee0d930 in ?? ()
#1 0x00007f19f3400d38 in ?? ()
#2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
#3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
#4 0x00000000006f75e5 in BufferAlloc (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:1012
#5 ReadBuffer_common (smgr=0x1f3c368, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=1, mode=RBM_NORMAL, strategy=0x0, hit=0x7fff6ef1c5af "") at bufmgr.c:745
[...]
(gdb) up
#1 0x00007f19f3400d38 in ?? ()
(gdb) up
#2 0x00000000006a4440 in PGSemaphoreLock (sema=0x7f19f3400d38) at pg_sema.c:310
310 errStatus = sem_wait(PG_SEM_REF(sema));
(gdb) up
#3 0x0000000000714495 in LWLockAcquire (lock=0x7f19f3404f80, mode=LW_SHARED) at lwlock.c:1233
1233 PGSemaphoreLock(proc->sem);
(gdb) p proc->lwWaiting
$1 = 0 '\000'
(gdb) p proc->lwWaitMode
$2 = 1 '\001'
(gdb) p proc
$3 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc
$4 = (PGPROC *) 0x7f1a77dba500
(gdb) p MyProc==proc
$5 = 1
I suppose one needs to show that this backend was stuck and not "just
happened to start 1ms before gdb connected", which I actually have:
postgres 27427 17665 0 16:57 ? 00:00:00 postgres: indigo indigo 207.241.169.248(37226) startup
...
postgres 30298 17665 0 17:01 ? 00:00:00 postgres: main main 207.241.169.100(43088) startup
..so that process was in "startup" 3-4 minutes after being forked.
> My hypothesis about a missed memory barrier would imply that there's (at
> least) one process that's waiting but is not in the lock's wait queue and
Do I have to also check the wait queue to verify? Give a hint/pointer please?
> > Would you suggest how I can maximize the likelyhood/speed of triggering that ?
> > Five years ago, with a report of similar symptoms, you said "You need to hack
> > pgbench to suppress the single initialization connection it normally likes to
> > make, else the test degenerates to the one-incoming-connection case"
> > https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us
>
> I don't think that case was related at all.
>
> My theory suggests that any contended use of an LWLock is at risk,
> in which case just running pgbench with about as many sessions as
> you have in the live server ought to be able to trigger it. However,
> that doesn't really account for your having observed the problem
> only during session startup,
Remember, this issue breaks existing sessions, too.
After it'd happened a couple times, I made sure to leave a session opened to
allow collecting diagnostics (and try to un-stick it), but when it recurred,
was unable to even SELECT * FROM pg_locks.
Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes
Justin
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-11-23 01:44:58 | Re: backends stuck in "startup" |
Previous Message | Tom Lane | 2017-11-23 00:57:36 | Re: query causes connection termination |