Re: [DESIGN] ParallelAppend

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: [DESIGN] ParallelAppend
Date: 2015-11-19 07:59:21
Message-ID: CAA4eK1J4oNGmxvB4_jY6CfemN5rpHbuMd7FrpTk5Q2gzUBmKqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > Don't we need the startup cost incase we need to build partial paths for
> > joinpaths like mergepath?
> > Also, I think there are other cases for single relation scan where
startup
> > cost can matter like when there are psuedoconstants in qualification
> > (refer cost_qual_eval_walker()) or let us say if someone has disabled
> > seq scan (disable_cost is considered as startup cost.)
>
> I'm not saying that we don't need to compute it. I'm saying we don't
> need to take it into consideration when deciding which paths have
> merit. Note that consider_statup is set this way:
>
> rel->consider_startup = (root->tuple_fraction > 0);
>

Even when consider_startup is false, still startup_cost is used for cost
calc, now may be ignoring that is okay for partial paths, but still it seems
worth thinking why leaving for partial paths it is okay even though it
is used in add_path().

+ * We don't generate parameterized partial paths because they seem
unlikely
+ * ever to be
worthwhile. The only way we could ever use such a path is
+ * by executing a nested loop with a complete
path on the outer side - thus,
+ * each worker would scan the entire outer relation - and the partial
path
+ * on the inner side - thus, each worker would scan only part of the inner
+ * relation. This is
silly: a parameterized path is generally going to be
+ * based on an index scan, and we can't generate a
partial path for that.

Won't it be useful to consider parameterized paths for below kind of
plans where we can push the jointree to worker and each worker can
scan the complete outer relation A and then the rest work is divided
among workers (ofcourse there can be other ways to parallelize such joins,
but still the way described also seems to be possible)?

NestLoop
-> Seq Scan on A
Hash Join
Join Condition: B.Y = C.W
-> Seq Scan on B
-> Index Scan using C_Z_IDX on C
Index Condition: C.Z = A.X

-
Is the main reason to have add_partial_path() is that it has some
less checks or is it that current add_path will give wrong answers
in any case?

If there is no case where add_path can't work, then there is some
advanatge in retaining add_path() atleast in terms of maintining
the code.

+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
{
..
+ /* Unless pathkeys are incompable, keep just one of the two paths. */
..

typo - 'incompable'

> > A.
> > This means that for inheritance child relations for which rel pages are
> > less than parallel_threshold, it will always consider the cost shared
> > between 1 worker and leader as per below calc in cost_seqscan:
> > if (path->parallel_degree > 0)
> > run_cost = run_cost / (path->parallel_degree + 0.5);
> >
> > I think this might not be the appropriate cost model for even for
> > non-inheritence relations which has pages more than parallel_threshold,
> > but it seems to be even worst for inheritance children which have
> > pages less than parallel_threshold
>
> Why?

Because I think the way code is written, it assumes that for each of the
inheritence-child relation which has pages lesser than threshold, half
the work will be done by master-backend which doesn't seem to be the
right distribution. Consider a case where there are three such children
each having cost 100 to scan, now it will cost them as
100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is
considering 0.5 of master backends work which seems to be wrong.

I think for Append case, we should consider this cost during Append path
creation in create_append_path(). Basically we can make cost_seqscan
to ignore the cost reduction due to parallel_degree for inheritance
relations
and then during Append path creation we can consider it and also consider
work unit of master backend as 0.5 with respect to overall work.

-
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
+plan as possible. Expanding the range of cases in which more work can be
+pushed below the Gather (and
costly them accurately) is likely to keep us
+busy for a long time to come.

Seems there is a typo in above text.
/costly/cost

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2015-11-19 08:41:42 Re: warning: HS_KEY redefined (9.5 beta2)
Previous Message Michael Paquier 2015-11-19 07:39:15 Re: [PROPOSAL] VACUUM Progress Checker.