From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: intermittent failures in Cygwin from select_parallel tests |
Date: | 2017-06-06 19:07:48 |
Message-ID: | CA+TgmoYGQViFsVPeMQM+9KvDAiPCEY1SmuH4=UrbfVjUswQ9ig@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jun 6, 2017 at 2:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> One thought is that the only places where shm_mq_set_sender() should
>> be getting invoked during the main regression tests are
>> ParallelWorkerMain() and ExecParallelGetReceiver, and both of those
>> places using ParallelWorkerNumber to figure out what address to pass.
>> So if ParallelWorkerNumber were getting set to the same value in two
>> different parallel workers - e.g. because the postmaster went nuts and
>> launched two processes instead of only one - or if
>> ParallelWorkerNumber were not getting initialized at all or were
>> getting initialized to some completely bogus value, it could cause
>> this symptom.
>
> Hmm. With some generous assumptions it'd be possible to think that
> aa1351f1eec4adae39be59ce9a21410f9dd42118 triggered this. That commit was
> present in 20 successful lorikeet runs before the first of these failures,
> which is a bit more than the MTBF after that, but not a huge amount more.
>
> That commit in itself looks innocent enough, but could it have exposed
> some latent bug in bgworker launching?
Hmm, that's a really interesting idea, but I can't quite put together
a plausible theory around it. I mean, it seems like that commit could
make launching bgworkers faster, which could conceivably tickle some
heretofore-latent timing-related bug. But it wouldn't, IIUC, make the
first worker start any faster than before - it would just make them
more closely-spaced thereafter, and it's not very obvious how that
would cause a problem.
Another idea is that the commit in question is managing to corrupt
BackgroundWorkerList somehow. maybe_start_bgworkers() is using
slist_foreach_modify(), but previously it always returned after
calling do_start_bgworker, and now it doesn't. So if
do_start_bgworker() did something that could modify the list
structure, then perhaps maybe_start_bgworkers() would get confused. I
don't really think that this theory has any legs, though.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2017-06-06 19:09:32 | Re: logical replication - still unstable after all these months |
Previous Message | Erik Rijkers | 2017-06-06 19:01:25 | Re: logical replication - still unstable after all these months |