Re: Parallel Append implementation

From: "Tels" <nospam-pg-abuse(at)bloodgate(dot)com>
To: "Amit Khandekar" <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: "Ashutosh Bapat" <ashutosh(dot)bapat(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Append implementation
Date: 2017-03-10 11:01:10
Message-ID: 68a7cc5f4d7c3b68f82bac5a95d97a3f.squirrel@sm.webmail.pair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Moin,

On Fri, March 10, 2017 3:24 am, Amit Khandekar wrote:
> On 10 March 2017 at 12:33, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat
>> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>>>
>>>> But as far as code is concerned, I think the two-list approach will
>>>> turn out to be less simple if we derive corresponding two different
>>>> arrays in AppendState node. Handling two different arrays during
>>>> execution does not look clean. Whereas, the bitmapset that I have used
>>>> in Append has turned out to be very simple. I just had to do the below
>>>> check (and that is the only location) to see if it's a partial or
>>>> non-partial subplan. There is nowhere else any special handling for
>>>> non-partial subpath.
>>>>
>>>> /*
>>>> * Increment worker count for the chosen node, if at all we found one.
>>>> * For non-partial plans, set it to -1 instead, so that no other
>>>> workers
>>>> * run it.
>>>> */
>>>> if (min_whichplan != PA_INVALID_PLAN)
>>>> {
>>>> if (bms_is_member(min_whichplan,
>>>> ((Append*)state->ps.plan)->partial_subplans_set))
>>>> padesc->pa_info[min_whichplan].pa_num_workers++;
>>>> else
>>>> padesc->pa_info[min_whichplan].pa_num_workers = -1;
>>>> }
>>>>
>>>> Now, since Bitmapset field is used during execution with such
>>>> simplicity, why not have this same data structure in AppendPath, and
>>>> re-use bitmapset field in Append plan node without making a copy of
>>>> it. Otherwise, if we have two lists in AppendPath, and a bitmap in
>>>> Append, again there is going to be code for data structure conversion.
>>>>
>>>
>>> I think there is some merit in separating out non-parallel and
>>> parallel plans within the same array or outside it. The current logic
>>> to assign plan to a worker looks at all the plans, unnecessarily
>>> hopping over the un-parallel ones after they are given to a worker. If
>>> we separate those two, we can keep assigning new workers to the
>>> non-parallel plans first and then iterate over the parallel ones when
>>> a worker needs a plan to execute. We might eliminate the need for
>>> special value -1 for num workers. You may separate those two kinds in
>>> two different arrays or within the same array and remember the
>>> smallest index of a parallel plan.
>
> Do you think we might get performance benefit with this ? I am looking
> more towards logic simplicity. non-parallel plans would be mostly
> likely be there only in case of UNION ALL queries, and not partitioned
> tables. And UNION ALL queries probably would have far lesser number of
> subplans, there won't be too many unnecessary hops. The need for
> num_workers=-1 will still be there for partial plans, because we need
> to set it to -1 once a worker finishes a plan.
>
>>
>> Further to that, with this scheme and the scheme to distribute workers
>> equally irrespective of the maximum workers per plan, you don't need
>> to "scan" the subplans to find the one with minimum workers. If you
>> treat the array of parallel plans as a circular queue, the plan to be
>> assigned next to a worker will always be the plan next to the one
>> which got assigned to the given worker.
>
>> Once you have assigned workers
>> to non-parallel plans, intialize a shared variable next_plan to point
>> to the first parallel plan. When a worker comes asking for a plan,
>> assign the plan pointed by next_plan and update it to the next plan in
>> the circular queue.
>
> At some point of time, this logic may stop working. Imagine plans are
> running with (1, 1, 1). Next worker goes to plan 1, so they run with
> (2, 1, 1). So now the next_plan points to plan 2. Now suppose worker
> on plan 2 finishes. It should not again take plan 2, even though
> next_plan points to 2. It should take plan 3, or whichever is not
> finished. May be a worker that finishes a plan should do this check
> before directly going to the next_plan. But if this is turning out as
> simple as the finding-min-worker-plan, we can use this logic. But will
> have to check. We can anyway consider this even when we have a single
> list.

Just a question for me to understand the implementation details vs. the
strategy:

Have you considered how the scheduling decision might impact performance
due to "inter-plan parallelism vs. in-plan parallelism"?

So what would be the scheduling strategy? And should there be a fixed one
or user-influencable? And what could be good ones?

A simple example:

E.g. if we have 5 subplans, and each can have at most 5 workers and we
have 5 workers overall.

So, do we:

Assign 5 workers to plan 1. Let it finish.
Then assign 5 workers to plan 2. Let it finish.
and so on

or:

Assign 1 workers to each plan until no workers are left?

In the second case you would have 5 plans running in a quasy-sequential
manner, which might be slower than the other way. Or not, that probably
needs some benchmarks?

Likewise, if you have a mix of plans with max workers like:

Plan A: 1 worker
Plan B: 2 workers
Plan C: 3 workers
Plan D: 1 worker
Plan E: 4 workers

Would the strategy be:

* Serve them in first-come-first-served order? (A,B,C,D?) (Would order
here be random due to how the plan's emerge, i.e. could the user re-order
query to get a different order?)
* Serve them in max-workers order? (A,D,B,C)
* Serve first all with 1 worker, then fill the rest? (A,D,B,C | A,D,C,B)
* Serve them by some other metric, e.g. index-only scans first, seq-scans
last? Or a mix of all these?

Excuse me if I just didn't see this from the thread so far. :)

Best regards,

Tels

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-03-10 11:08:42 Re: Report the number of skipped frozen pages by manual VACUUM
Previous Message Ashutosh Bapat 2017-03-10 10:43:40 Re: Partition-wise join for join between (declaratively) partitioned tables