Re: backends stuck in "startup"

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-general(at)postgresql(dot)org
Subject: Re: backends stuck in "startup"
Date: 2017-11-23 01:44:58
Message-ID: 16822.1511401498@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Justin Pryzby <pryzby(at)telsasoft(dot)com> writes:
> On Wed, Nov 22, 2017 at 07:43:50PM -0500, Tom Lane wrote:
>> My hypothesis about a missed memory barrier would imply that there's (at
>> least) one process that's waiting but is not in the lock's wait queue and

> Do I have to also check the wait queue to verify? Give a hint/pointer please?

Andres probably knows more about this data structure than I do, but I
believe that the values in the LWLock's proclist_head field are indexes
into the PGProc array, and that the PGProc.lwWaitLink proclist_node fields
contain the fore and aft pointers in a doubly-linked list of waiting
processes. But chasing through that by hand is going to be darn tedious
if there are a bunch of processes queued for the same lock. In any case,
if the process is blocked right there and its lwWaiting field is not set,
that is sufficient proof of a bug IMO. What is not quite proven yet is
why it failed to detect that it'd been woken.

I think really the most useful thing at this point is just to wait and
see if your SYSV-semaphore build exhibits the same problem or not.
If it does not, we can be pretty confident that *something* is wrong
with the POSIX-semaphore code, even if my current theory isn't it.

>> My theory suggests that any contended use of an LWLock is at risk,
>> in which case just running pgbench with about as many sessions as
>> you have in the live server ought to be able to trigger it. However,
>> that doesn't really account for your having observed the problem
>> only during session startup,

> Remember, this issue breaks existing sessions, too.

Well, once one session is hung up, anything else that came along wanting
access to that same LWLock would also get stuck. Since the lock in
question is a buffer partition lock controlling access to something like
1/128'th of the shared buffer pool, it would not take too long for every
active session to get stuck there, whether it were doing anything related
or not.

In any case, if you feel like trying the pgbench approach, I'd suggest
setting up a script to run a lot of relatively short runs rather than one
long one. If there is something magic about the first blockage in a
session, that would help catch it.

> Am I right this won't help for lwlocks? ALTER SYSTEM SET log_lock_waits=yes

Nope, that's just for heavyweight locks. LWLocks are lightweight
precisely because they don't have stuff like logging, timeouts,
or deadlock detection ...

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message support-tiger 2017-11-23 01:45:44 update field in jsonb
Previous Message Justin Pryzby 2017-11-23 01:24:50 Re: backends stuck in "startup"