Re: Parallel INSERT (INTO ... SELECT ...)

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Greg Nancarrow <gregn4422(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Date: 2020-10-09 12:04:38
Message-ID: CAA4eK1Ko8KFOXhK6CkgKXxNVA0uzJEpTbEcWOH3scfpu5GR=cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 9, 2020 at 4:28 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
>
> On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
> > >
> > > Speaking of costing, I'm not sure I really agree with the current
> > > costing of a Gather node. Just considering a simple Parallel SeqScan
> > > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> > > Gather cost always completely drowns out any other path costs when a
> > > large number of rows are involved (at least with default
> > > parallel-related GUC values), such that Parallel SeqScan would never
> > > be the cheapest path. This linear relationship in the costing based on
> > > the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> > > after a certain amount of rows, the overhead of launching workers will
> > > be out-weighed by the benefit of their parallel work, such that the
> > > more rows, the more likely a Parallel SeqScan will benefit.
> > >
> >
> > That will be true for the number of rows/pages we need to scan not for
> > the number of tuples we need to return as a result. The formula here
> > considers the number of rows the parallel scan will return and the
> > more the number of rows each parallel node needs to pass via shared
> > memory to gather node the more costly it will be.
> >
> > We do consider the total pages we need to scan in
> > compute_parallel_worker() where we use a logarithmic formula to
> > determine the number of workers.
> >
>
> Despite all the best intentions, the current costings seem to be
> geared towards selection of a non-parallel plan over a parallel plan,
> the more rows there are in the table. Yet the performance of a
> parallel plan appears to be better than non-parallel-plan the more
> rows there are in the table.
> This doesn't seem right to me. Is there a rationale behind this costing model?
>

Yes, AFAIK, there is no proof that we can get any (much) gain by
dividing the I/O among workers. It is primarily the CPU effort which
gives the benefit. So, the parallel plans show greater benefit when we
have to scan a large table and then project much lesser rows.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2020-10-09 12:23:47 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message John Naylor 2020-10-09 11:08:26 Re: [PATCH] ecpg: fix progname memory leak