From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Assertion failure with barriers in parallel hash join |
Date: | 2021-03-05 20:56:36 |
Message-ID: | CA+hUKGK32wJdQ9p4dQq1g8a+sL5mk5q4jUtkjrgGZozL945pOg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Oct 13, 2020 at 12:18 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Tue, Oct 13, 2020 at 12:15 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > On Thu, Oct 1, 2020 at 8:08 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >> On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >> Here's a throw-away patch to add some sleeps that trigger the problem,
> >> and a first draft fix. I'll do some more testing of this next week
> >> and see if I can simplify it.
> >
> > I was just taking a look at the patch and noticed the commit message
> > says:
> >
> > > With unlucky timing and parallel_leader_participation off...
> >
> > Is parallel_leader_participation being off required to reproduce the
> > issue?
>
> Yeah, because otherwise the leader detaches last so the problem doesn't arise.
While working on Melanie's Parallel Hash Full Join patch I remembered
that this (apparently extremely rare) race still needs fixing. Here
is a slightly tidied version, which I'm adding to the next CF for CI
coverage.
Here also is a picture that comes from an unfinished description of
this algorithm that I've been trying to write, that might help explain
the change. It's a phase diagram, where you can see the phases "run"
(= all processes try to work on batches) and "done" (= one process is
freeing the shmem objects for tracking batches, anyone who attaches to
the barrier in this phase knows that it's not even safe to access
batch bookkeeping memory). Before this patch there is no "run", just
"done" (= process batches and then one process frees, which has a race
if someone else attaches really late, after the freeing has begun).
I'm currently wondering whether this can be further improved using
Melanie's new BarrierArriveAndDetachExceptLast() function.
(In the code the phase names have -ing on the end, I'll probably drop
those, because commit 3048898e73c did that to the corresponding wait
events.)
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Inject-fault-timing.patch | text/x-patch | 1.4 KB |
v2-0002-Fix-race-condition-in-parallel-hash-join-batch-cl.patch | text/x-patch | 9.9 KB |
phj-barriers.png | image/png | 154.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2021-03-05 21:01:38 | Re: Allow matching whole DN from a client certificate |
Previous Message | Andres Freund | 2021-03-05 20:55:37 | Re: CI/windows docker vs "am a service" autodetection on windows |