From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: intermittent failures in Cygwin from select_parallel tests |
Date: | 2017-06-15 18:34:50 |
Message-ID: | 29733.1497551690@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think you're right. So here's a theory:
> 1. The ERROR mapping the DSM segment is just a case of the worker the
> losing a race, and isn't a bug.
I concur that this is a possibility, but if we expect this to happen,
seems like there should be other occurrences in the buildfarm logs.
I trolled the last three months worth of check/installcheck logs (all runs
not just failures), and could find exactly two cases of "could not map
dynamic shared memory segment":
sysname | branch | snapshot | stage | l
----------+---------------+---------------------+----------------+---------------------------------------------------------------------------------------------------
lorikeet | REL9_6_STABLE | 2017-05-03 10:21:31 | Check | 2017-05-03 06:27:32.626 EDT [5909b094.1e28:1] ERROR: could not map dynamic shared memory segment
lorikeet | HEAD | 2017-06-13 20:28:33 | InstallCheck-C | 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment
(2 rows)
Now maybe this can be explained away by saying that the worker never loses
the race unless it's subject to cygwin's unusually slow fork() emulation,
but somehow I doubt that. For one thing, it's not clear why that path
would be slower than EXEC_BACKEND, which would also involve populating
a new process image from scratch.
BTW, that 9.6 failure is worth studying because it looks quite a bit
different from the one on HEAD. It looks like the worker failed to
launch and then the leader got hung up waiting for the worker.
Eventually other stuff started failing because the select_parallel
test is holding an exclusive lock on tenk1 throughout its session.
(Does it really need to do that ALTER TABLE?)
> 2. But when that happens, parallel_terminate_count is getting bumped
> twice for some reason.
> 3. So then the leader process fails that assertion when it tries to
> launch the parallel workers for the next query.
It seems like this has to trace to some sort of logic error in the
postmaster that's allowing it to mess up parallel_terminate_count,
but I'm not managing to construct a plausible flow of control that
would cause that.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2017-06-15 19:09:32 | Re: WIP: Data at rest encryption |
Previous Message | Petr Jelinek | 2017-06-15 18:07:00 | Re: Get stuck when dropping a subscription during synchronizing table |