Re: Missed condition-variable wakeups on FreeBSD

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Missed condition-variable wakeups on FreeBSD
Date: 2022-02-26 20:44:31
Message-ID: 20220226204431.kycjr4dd5wezmpt3@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-26 14:07:05 -0500, Tom Lane wrote:
> About once a month over the last six months, my buildfarm animal
> florican has gotten stuck while running the core regression tests.
> The symptoms have looked very much the same each time: there is
> a backend with two parallel worker processes that are just sitting
> and not consuming any CPU time. Each time I've attached to these
> processes with gdb to check their stack traces, and seen pretty
> much the same story every time (traces below). What is really
> interesting is that after I detach from the second worker, the
> processes resume running and finish out the test successfully!
> I don't know much about how gdb interacts with kernel calls on
> FreeBSD, but I speculate that the poll(2) call returns with EINTR
> after gdb releases the process, and then things resume fine,
> suggesting that we lost an interrupt somewhere.
>
> I have observed this three times in the REL_11 branch, once
> in REL_12, and a couple of times last summer before it occurred
> to me to start keeping notes. Over that time the machine has
> been running various patchlevels of FreeBSD 13.0.

It's certainly interesting that it appears to happen only in the branches
using poll rather than kqueue to implement latches. That changed between 12
and 13.

Have you tried running the core regression tests with force_parallel_mode =
on, or with the parallel costs lowered, to see if that makes the problem
appear more often?

> Here's the stack trace from the leader process in the most
> recent event (REL_11 as of yesterday). It's not always the
> same query that gets stuck, but it's always a parallel hash join:

Which does make me wonder if it's a latch issue or a logic issue in hashjoin /
barriers. HJ is the only user of barrier.c...

There's this commit, which subsequently was reverted due to some issue, and
not re-applied since...

commit 4e0f0995e923948631c4114ab353b256b51b58ad
Author: Thomas Munro <tmunro(at)postgresql(dot)org>
Date: 2021-03-17 17:46:39 +1300

Fix race in Parallel Hash Join batch cleanup.

It doesn't seem super likely to be related, but...

> (gdb) p debug_query_string
> $1 = 0x21873090 "select count(*) from simple r join simple s using (id);"
> (gdb) bt
> #0 _poll () at _poll.S:4
> #1 0x21701361 in __thr_poll (fds=0x219dc170, nfds=2, timeout=-1) at /usr/src/lib/libthr/thread/thr_syscalls.c:338
> #2 0x215eaf3f in poll (pfd=0x219dc170, nfds=2, timeout=-1) at /usr/src/lib/libc/sys/poll.c:47
> #3 0x0097b0fd in WaitEventSetWaitBlock (set=<optimized out>, cur_timeout=-1, occurred_events=<optimized out>, nevents=<optimized out>) at latch.c:1171

The next time this happens / if you still have this open, perhaps it could be
worth checking if there's a byte in the self pipe?

> Thoughts? Ideas on debugging this?

Besides trying to make the issue more likely as suggested above, it might be
worth checking if signalling the stuck processes with SIGUSR1 gets them
unstuck.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-02-26 21:06:25 Re: Missed condition-variable wakeups on FreeBSD
Previous Message Tom Lane 2022-02-26 19:07:05 Missed condition-variable wakeups on FreeBSD