From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |
Date: | 2023-01-29 17:41:10 |
Message-ID: | 20230129174110.dc6eh7l6h7abakri@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-01-30 06:26:02 +1300, Thomas Munro wrote:
> On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > So I did that - same configure options as the buildfarm client, and a
> > 'make check' (with only tests up to the 'join' suite, because that's
> > where it got stuck before). And it took only ~15 runs (~1h) to hit this
> > again on dikkop.
>
> That's good news.
Indeed.
As annoying as it is, it might be worth reproing it once or twice more, just
to have a feeling for how long we need to run to have confidence in a fix.
> I was talking to Andres on IM about this yesterday and he pointed out
> a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
> tell the signal handler to write to the self-pipe) and then reads
> latch->is_set with neither compiler nor memory barrier, which doesn't
> seem right because we might see a value of latch->is_set from before
> "waiting" was true, and yet the signal handler might also have run
> while "waiting" was false so the self-pipe doesn't save us, despite
> the length of the comment about that. Can you reproduce it with this
> change?
>
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
> * ordering, so that we cannot miss seeing is_set if a notificat
> ion
> * has already been queued.
> */
> + pg_memory_barrier();
> if (set->latch && set->latch->is_set)
> {
> occurred_events->fd = PGINVALID_SOCKET;
I think we need a barrier in SetLatch(), after is_set = true. We have that in
some of the newer branches (due to the maybe_sleeping logic), but not in the
older branches.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-01-29 17:53:36 | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |
Previous Message | Tomas Vondra | 2023-01-29 17:39:05 | Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) |