Re: Parallel query execution introduces performance regressions

From: Jinho Jung <jinhojun(at)usc(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel query execution introduces performance regressions
Date: 2019-04-01 19:08:59
Message-ID: CAF6pyKf61cJVLr0uO_vhjDAWUTAEL6UBNs8WeT+5=ketxKOBXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Sorry for the duplicated thread here and thanks for the reply.

I have another question regarding parallel execution. We noticed that
PostgreSQL sometimes launches only one worker. Is that also an intended
behavior?

Best regards,
Jinho Jung

On Mon, Apr 1, 2019 at 3:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> > On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun(at)usc(dot)edu> wrote:
> > > Surprisingly, we found that even on a larger TPC-C database (scale
> factor of 50, roughly 4GB of size), parallel scan is still slower than the
> non-parallel execution plan in the old version.
> >
> > That's not a large database, and it's certainly not a large TPC-C
> > database. If you attempt to stay under the spec's maximum
> > tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> > you'll need several thousand warehouses on modern hardware. We're
> > talking several hundred gigabytes. Otherwise, as far as the spec is
> > concerned you're testing an unrealistic workload. There will be
> > individual customers that make many more purchases than is humanly
> > possible. You're modelling an app involving hypothetical warehouse
> > employees that must enter data into their terminals at a rate that is
> > not humanly possible.
>
> I don't think that's really the problem here. It's that there's a
> fundamental misestimation in the query:
>
> > [OLD version]
> > Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8)
> > (actual time=0.034..24981.895 rows=90017507 loops=1)
> > Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
> > -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504
> > width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
> > -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual
> > time=0.000..0.000 rows=1 loops=90017507)
> > -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000
> width=4)
> > (actual time=0.006..0.006 rows=1 loops=1)
>
> note the estimated rows=100000 vs the actual rows=1 in the seqscan /
> materialize. That's what makes the planner think this is much more
> expensive than it is, which in turn triggers the use of a parallel scan.
>
> Greetings,
>
> Andres Freund
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2019-04-01 19:13:04 Re: Parallel query execution introduces performance regressions
Previous Message Peter Geoghegan 2019-04-01 19:05:57 Re: Parallel query execution introduces performance regressions