From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
Subject: | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date: | 2018-01-24 06:53:11 |
Message-ID: | CAEepm=2aGWaQz6kMfC1ZeWJt1NxhP0f2j9y=o_Lq260a7sVEHg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jan 24, 2018 at 6:43 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>>>> I am going to repeat my previous suggest that we use a Barrier here.
>>>>> Given the discussion subsequent to my original proposal, this can be a
>>>>> lot simpler than what I suggested originally. Each worker does
>>>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>>>> sorting. The leader does BarrierAttach() before launching workers and
>>>>> BarrierArriveAndWait() when it's done sorting.
>>>
>>> How does leader detect if one of the workers does BarrierAttach and
>>> then fails (either exits or error out) before doing
>>> BarrierArriveAndDetach?
>>
>> If you attach and then exit cleanly, that's a programming error and
>> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>>
>
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait. I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.
Actually what I said before is no longer true: after commit 2badb5af,
if you exit unexpectedly then the new ParallelWorkerShutdown() exit
hook delivers PROCSIG_PARALLEL_MESSAGE (apparently after detaching
from the error queue) and the leader aborts when it tries to read the
error queue. I just hacked Parallel Hash like this:
BarrierAttach(build_barrier);
+ if (ParallelWorkerNumber == 0)
+ {
+ pg_usleep(1000000);
+ proc_exit(1);
+ }
Now I see:
postgres=# select count(*) from foox r join foox s on r.a = s.a;
ERROR: lost connection to parallel worker
Using a debugger I can see the leader raising that error with this stack:
HandleParallelMessages at parallel.c:890
ProcessInterrupts at postgres.c:3053
ConditionVariableSleep(cv=0x000000010a62e4c8,
wait_event_info=134217737) at condition_variable.c:151
BarrierArriveAndWait(barrier=0x000000010a62e4b0,
wait_event_info=134217737) at barrier.c:191
MultiExecParallelHash(node=0x00007ffcd9050b10) at nodeHash.c:312
MultiExecHash(node=0x00007ffcd9050b10) at nodeHash.c:112
MultiExecProcNode(node=0x00007ffcd9050b10) at execProcnode.c:502
ExecParallelHashJoin [inlined]
ExecHashJoinImpl(pstate=0x00007ffcda01baa0, parallel='\x01') at
nodeHashjoin.c:291
ExecParallelHashJoin(pstate=0x00007ffcda01baa0) at nodeHashjoin.c:582
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro HORIGUCHI | 2018-01-24 07:11:53 | Re: Index-only scan returns incorrect results when using a composite GIST index with a gist_trgm_ops column. |
Previous Message | Catalin Iacob | 2018-01-24 06:46:41 | Re: Doc tweak for huge_pages? |