From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] parallel.c oblivion of worker-startup failures |
Date: | 2018-01-24 04:25:30 |
Message-ID: | CAEepm=3j-LETxYHRDJN2JQ0H1NpE+ydv7uwHmcH8qDvV7d+89w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> In the case of Gather, for example, we won't
>> WaitForParallelWorkersToFinish() until we've read all the tuples. If
>> there's a tuple queue that does not yet have a sender, then we'll just
>> wait for it to get one, even if the sender fell victim to a fork
>> failure and is never going to show up.
>>
>
> Hmm, I think that case will be addressed because tuple queues can
> detect if the leader is not attached. It does in code path
> shm_mq_receive->shm_mq_counterparty_gone. In
> shm_mq_counterparty_gone, it can detect if the worker is gone by using
> GetBackgroundWorkerPid. Moreover, I have manually tested this
> particular case before saying your patch is fine. Do you have some
> other case in mind which I am missing?
Hmm. Yeah. I can't seem to reach a stuck case and was probably just
confused and managed to confuse Robert too. If you make
fork_process() fail randomly (see attached), I see that there are a
couple of easily reachable failure modes (example session at bottom of
message):
1. HandleParallelMessages() is reached and raises a "lost connection
to parallel worker" error because shm_mq_receive() returns
SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked
GetBackgroundWorkerPid() just as you said. I guess that's happening
because some other process is (coincidentally) sending
PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a
process is unexpectedly stopped.
2. WaitForParallelWorkersToFinish() is reached and raises a "parallel
worker failed to initialize" error. TupleQueueReaderNext() set done
to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once
again, that is because shm_mq_counterparty_gone() returned true. This
is the bit Robert and I missed in our off-list discussion.
As long as we always get our latch set by the postmaster after a fork
failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is
guaranteed to return BGWH_STOPPED after that, and as long as we only
ever use latch/CFI loops to wait, and as long as we try to read from a
shm_mq, then I don't see a failure mode that hangs.
This doesn't help a parallel.c client that doesn't try to read from a
shm_mq though, like parallel CREATE INDEX (using the proposed design
without dynamic Barrier). We can't rely on a coincidental
PROCSIG_PARALLEL_MESSAGE like in (1), and it's not going to try to
read from a queue like in (2). So wouldn't it need a way to perform
its own similar check for unexpectedly BGWH_STOPPED processes whenever
its latch is set?
If there were some way for the postmaster to cause reason
PROCSIG_PARALLEL_MESSAGE to be set in the leader process instead of
just notification via kill(SIGUSR1) when it fails to fork a parallel
worker, we'd get (1) for free in any latch/CFI loop code. But I
understand that we can't do that by project edict.
===
postgres=# create table foox as select generate_series(1, 10000) as a;
SELECT 10000
postgres=# alter table foox set (parallel_workers = 4);
ALTER TABLE
postgres=# set max_parallel_workers_per_gather = 4;
SET
postgres=# select count(*) from foox;
ERROR: lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR: parallel worker failed to initialize
HINT: More details may be available in the server log.
postgres=# select count(*) from foox;
count
-------
10000
(1 row)
postgres=# select count(*) from foox;
ERROR: lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR: lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR: lost connection to parallel worker
postgres=# select count(*) from foox;
ERROR: lost connection to parallel worker
--
Thomas Munro
http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
chaos-monkey-fork-process.patch | application/octet-stream | 775 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2018-01-24 04:27:11 | Re: PATCH: Configurable file mode mask |
Previous Message | Tom Lane | 2018-01-24 04:16:23 | Re: documentation is now XML |