From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: strange parallel query behavior after OOM crashes |
Date: | 2017-04-04 17:52:29 |
Message-ID: | 73b84052-5dfe-0c40-4212-cfa96a1c6027@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 04/04/2017 06:52 PM, Robert Haas wrote:
> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>> <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>>>> 2. the server restarts automatically, initialize
>>>> BackgroundWorkerData->parallel_register_count and
>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>> parallel_terminate_count.
>>>
>>> Hmm. So this seems like the root of the problem. Presumably those
>>> things need to be reset AFTER forgetting any background workers from
>>> before the crash.
>>>
>> IMHO, the fix would be not to increase the terminated parallel worker
>> count whenever ForgetBackgroundWorker is called due to a bgworker
>> crash. I've attached a patch for the same. PFA.
>
> While I'm not opposed to that approach, I don't think this is a good
> way to implement it. If you want to pass an explicit flag to
> ForgetBackgroundWorker telling it whether or not it should performing
> the increment, fine. But with what you've got here, you're
> essentially relying on "spooky action at a distance". It would be
> easy for future code changes to break this, not realizing that
> somebody's got a hard dependency on 0 having a specific meaning.
>
I'm probably missing something, but I don't quite understand how these
values actually survive the crash. I mean, what I observed is OOM
followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply
reset the values back to 0? Or do we call ForgetBackgroundWorker() after
the crash for some reason?
In any case, the comment right before BackgroundWorkerArray says this:
* These counters can of course overflow, but it's not important here
* since the subtraction will still give the right number.
which means that this assert
+ Assert(BackgroundWorkerData->parallel_register_count >=
+ BackgroundWorkerData->parallel_terminate_count);
is outright broken, just like any other attempts to rely on simple
comparisons of these two counters, no?
> BTW, if this isn't on the open items list, it should be. It's
> presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.
>
Unrelated, but perhaps we should also add this to open items:
https://www.postgresql.org/message-id/flat/57bbc52c-5d40-bb80-2992-7e1027d0b67c%402ndquadrant.com
(although that's more a 9.6 material).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-04-04 17:54:19 | Re: Compiler warning in costsize.c |
Previous Message | Pavan Deolasee | 2017-04-04 17:48:44 | Re: Patch: Write Amplification Reduction Method (WARM) |