Re: strange parallel query behavior after OOM crashes

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Neha Khatri <nehakhatri5(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: strange parallel query behavior after OOM crashes
Date: 2017-03-31 04:42:11
Message-ID: CAGz5QCLAvbbhFfkdD_C3DBOn-Rbo40yrwJk-Pky3PVQoSx9Scg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 31, 2017 at 5:43 AM, Neha Khatri <nehakhatri5(at)gmail(dot)com> wrote:
>
> On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
> wrote:
>>
>> On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
>> <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>> >
>> > 1. Put an Assert(0) in ParallelQueryMain(), start server and execute
>> > any parallel query.
>> > In LaunchParallelWorkers, you can see
>> > nworkers = n nworkers_launched = n (n>0)
>> > But, all the workers will crash because of the assert statement.
>> > 2. the server restarts automatically, initialize
>> > BackgroundWorkerData->parallel_register_count and
>> > BackgroundWorkerData->parallel_terminate_count in the shared memory.
>> > After that, it calls ForgetBackgroundWorker and it increments
>> > parallel_terminate_count. In LaunchParallelWorkers, we have the
>> > following condition:
>> > if ((BackgroundWorkerData->parallel_register_count -
>> > BackgroundWorkerData->parallel_terminate_count) >=
>> > max_parallel_workers)
>> > DO NOT launch any parallel worker.
>> > Hence, nworkers = n nworkers_launched = 0.
>> parallel_register_count and parallel_terminate_count, both are
>> unsigned integer. So, whenever the difference is negative, it'll be a
>> well-defined unsigned integer and certainly much larger than
>> max_parallel_workers. Hence, no workers will be launched. I've
>> attached a patch to fix this.
>
>
> The current explanation of active number of parallel workers is:
>
> * The active
> * number of parallel workers is the number of registered workers minus the
> * terminated ones.
>
> In the situations like you mentioned above, this formula can give negative
> number for active parallel workers. However a negative number for active
> parallel workers does not make any sense.
Agreed.

> I feel it would be better to explain in code that in what situations, the
> formula
> can generate a negative result and what that means.
I think that we need to find a fix so that it never generates a
negative result. The last patch submitted by me generates a negative
value correctly. But, surely that's not enough.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-03-31 04:46:28 Re: postgres_fdw IMPORT SCHEMA and partitioned tables
Previous Message Amit Langote 2017-03-31 04:37:55 Re: postgres_fdw IMPORT SCHEMA and partitioned tables