Re: Parallel Seq Scan

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Seq Scan
Date: 2015-04-28 12:07:34
Message-ID: CA+TgmoZJcsOxN+46ppEDeigN-E0f5WxUH0F7H-M6YLLF1A7KFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 24, 2015 at 8:32 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> - InitializeParallelWorkers() still mixes together general parallel
>> executor concerns with concerns specific to parallel sequential scan
>> (e.g. EstimatePartialSeqScanSpace).
>
> Here we are doing 2 things, first one is for planned statement and
> then second one is node specific which in the case is parallelheapscan
> descriptor. So If I understand correctly, you want that we remove second
> one and have a recursive function to achieve the same.

Right.

>> - It's hard to believe this is right:
>>
>> + if (parallelstmt->inst_options)
>> + receiver = None_Receiver;
>>
>> Really? Flush the tuples if there are *any instrumentation options
>> whatsoever*? At the very least, that doesn't look too future-proof,
>> but I'm suspicious that it's outright incorrect.
>
> instrumentation info is for explain statement where we don't need
> tuples and it is set same way for it as well, refer ExplainOnePlan().
> What makes you feel this is incorrect?

Well, for one thing, it's going to completely invalidate the result of
EXPLAIN. I mean, consider this:

Hash Join
-> Parallel Seq Scan
-> Hash
-> Seq Scan

If you have the workers throw away the rows from the parallel seq scan
instead of sending them back to the master, the master won't join
those rows against the other table. And then the "actual" row counts,
timing, etc. will all be totally wrong. Worse, if the user is
EXPLAIN-ing a SELECT INTO command, the results will be totally wrong.

I don't think you can use ExplainOnePlan() as precedent for the theory
that explain_options != 0 means discard everything, because that
function does not do that. It bases the decision to throw away the
output on the fact that EXPLAIN was used, and throws it away unless an
IntoClause was also specified. It does this even if
instrument_options == 0. Meanwhile, auto_explain does NOT throw away
the output even if instrument_options != 0, nor should it! But even
if none of that were an issue, throwing away part of the results from
an internal plan tree is not the same thing as throwing away the final
result stream, and is dead wrong.

>> - I think ParallelStmt probably shouldn't be defined in parsenodes.h.
>> That file is included in a lot of places, and adding all of those
>> extra #includes there doesn't seem like a good idea for modularity
>> reasons even if you don't care about partial rebuilds. Something that
>> includes a shm_mq obviously isn't a "parse" node in any meaningful
>> sense anyway.
>
> How about tcop/tcopprot.h?

The comment of that file is "prototypes for postgres.c".

Generally, unless there is some reason to do otherwise, the prototypes
for a .c file in src/backend go in a .h file with the same name in
src/include. I don't see why we should do differently here.
ParallelStmt should be defined and used in a file living in
src/backend/executor, and the header should have the same name and go
in src/include/executor.

>> - I don't think you need both setup cost and startup cost. Starting
>> up more workers isn't particularly more expensive than starting up
>> fewer of them, because most of the overhead is in waiting for them to
>> actually start, and the number of workers is reasonable, then they're
>> all be doing that in parallel with each other. I suggest removing
>> parallel_startup_cost and keeping parallel_setup_cost.
>
> There is some work (like creation of shm queues, launching of workers)
> which is done proportional to number of workers during setup time. I
> have kept 2 parameters to distinguish such work. I think you have a
> point that start of some or all workers could be parallel, but I feel
> that still is a work proportinal to number of workers. For future
> parallel operations also such a parameter could be useful where we need
> to setup IPC between workers or some other stuff where work is proportional
> to workers.

That's technically true, but the incremental work involved in
supporting a new worker is extremely small compare with worker startup
times. I'm guessing that the setup cost is going to be on the order
of hundred-thousands or millions and and the startup cost is going to
be on the order of tens or ones. Unless you can present some contrary
evidence, I think we should rip it out.

And I actually hope you *can't* present some contrary evidence.
Because if you can, then that might mean that we need to cost every
possible path from 0 up to N workers and let the costing machinery
decide which one is better. If you can't, then we can cost the
non-parallel path and the maximally-parallel path and be done. And
that would be much better, because it will be faster. Remember, just
because we cost a bunch of parallel paths doesn't mean that any of
them will actually be chosen. We need to avoid generating too much
additional planner work in cases where we don't end up deciding on
parallelism anyway.

>> - In cost_funnel(), I don't think it's right to divide the run cost by
>> nWorkers + 1. Suppose we've got a plan that looks like this:
>>
>> Funnel
>> -> Hash Join
>> -> Partial Seq Scan on a
>> -> Hash
>> -> Seq Scan on b
>>
>> The sequential scan on b is going to get executed once per worker,
>> whereas the effort for the sequential scan on a is going to be divided
>> over all the workers. So the right way to cost this is as follows:
>>
>> (a) The cost of the partial sequential scan on a is equal to the cost
>> of a regular sequential scan, plus a little bit of overhead to account
>> for communication via the ParallelHeapScanDesc, divided by the number
>> of workers + 1.
>> (b) The cost of the remaining nodes under the funnel works normally.
>> (c) The cost of the funnel is equal to the cost of the hash join plus
>> number of tuples multiplied by per-tuple communication overhead plus a
>> large fixed overhead reflecting the time it takes the workers to
>> start.
>>
>
> IIUC, the change for this would be to remove the change related to
> run cost (divide the run cost by nWorkers + 1) from cost_funnel
> and made similar change as suggested by point (a) in cost calculation
> of partial sequence scan.

Right.

> As of now, we don't do anything which can
> move Funnel node on top of hash join, so not sure if you are expecting
> any extra handling as part of point (b) or (c).

But we will want to do that in the future, so we should set up the
costing correctly now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-04-28 12:11:01 Re: Freeze avoidance of very large table.
Previous Message Etsuro Fujita 2015-04-28 11:45:33 Re: ATSimpleRecursion() and inheritance foreign parents