Quick Links

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date:	2023-01-29 17:39:05
Message-ID:	4dcd8d2b-efd6-4ede-1c43-f2dbd760ea3e@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 1/29/23 18:26, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> So I did that - same configure options as the buildfarm client, and a
>> 'make check' (with only tests up to the 'join' suite, because that's
>> where it got stuck before). And it took only ~15 runs (~1h) to hit this
>> again on dikkop.
>
> That's good news.
>
>> I managed to collect the fstat/procstat stuff Thomas asked for, and the
>> backtraces - attached. I still have the core files, in case we look at
>> something. As before, running gcore on the second worker (29081) gets
>> this unstuck - it sends some signal that apparently wakes it up.
>
> Thanks! As expected, no bytes in the pipe for any those processes.
> Unfortunately I gave the wrong procstat command, it should be -i, not
> -j. Does "procstat -i /path/to/core | grep USR1" show P (pending) for
> that stuck process? Silly question really, I don't really expect
> poll() to be misbehaving in such a basic way.
>

It shows "--C" for all three processes, which should mean "will be caught".

> I was talking to Andres on IM about this yesterday and he pointed out
> a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
> tell the signal handler to write to the self-pipe) and then reads
> latch->is_set with neither compiler nor memory barrier, which doesn't
> seem right because we might see a value of latch->is_set from before
> "waiting" was true, and yet the signal handler might also have run
> while "waiting" was false so the self-pipe doesn't save us, despite
> the length of the comment about that. Can you reproduce it with this
> change?
>

Will do, but I'll wait for another lockup to see how frequent it
actually is. I'm now at ~90 runs total, and it didn't happen again yet.
So hitting it after 15 runs might have been a bit of a luck.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) at 2023-01-29 17:26:02 from Thomas Munro

Responses

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) at 2023-01-29 17:53:36 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2023-01-29 17:41:10	Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Previous Message	Thomas Munro	2023-01-29 17:26:02	Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)