Quick Links

Re: TPC-H Q20 from 1 hour to 19 hours!

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: TPC-H Q20 from 1 hour to 19 hours!
Date:	2017-03-29 22:14:30
Message-ID:	060a9d76-67ab-0152-ff2e-36bf9156fd78@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 03/29/2017 09:00 PM, Robert Haas wrote:
> On Mon, Mar 6, 2017 at 1:22 AM, Rafia Sabih
> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>> This is to bring to notice a peculiar instance I found recently while
>> running TPC-H benchmark queries. Q20 of the benchmark took 19 hours to
>> complete ...
>
> That's bad.
>
>> It is clear that selectivity estimations are really bad in this case
>> particularly at node,
>> -> Merge Join (cost=52959586.72..60024468.82 rows=85 width=16)
>> (actual time=1525322.753..2419045.641 rows=1696742 loops=1)
>> Merge Cond: ((lineitem.l_partkey =
>> partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey))
>> Join Filter:
>> ((partsupp.ps_availqty)::numeric > ((0.5 * sum(lineitem.l_quantity))))
>> Rows Removed by Join Filter: 3771
>
> So, the selectivity estimation here is bad both before and after Tom's
> commit, but it's significantly worse after (actual value 1696742, old
> estimate 3771, new estimate 85).
>
>> Still this puzzled me as during earlier runs of this benchmark I never
>> encountered such prolonged running times. On further investigation I
>> found that on reverting the commit
>> 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218
>> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Date: Sat Dec 17 15:28:54 2016 -0500
>> Fix FK-based join selectivity estimation for semi/antijoins.
>
> I don't think the problem originates at the Merge Join, though,
> because the commit says that at is fixing semi and anti-join estimates
> - this is a plain inner join, so in theory it shouldn't change.
> However, it's a bit hard for me to piece through these plans, the
> formatting kind of got messed up - things are wrapped. Could you
> possibly attach the plans as attachments?
>

I've been looking into this today, and it seems to me the simplest query
triggering this issue (essentially a part of q20) is this:

select
ps_suppkey
from
partsupp,
(
select
l_partkey agg_partkey,
l_suppkey agg_suppkey
from
lineitem
group by
l_partkey,
l_suppkey
) agg_lineitem
where
agg_partkey = ps_partkey
and agg_suppkey = ps_suppkey
and ps_partkey in (
select
p_partkey
from
part
where
p_name like 'hot%'
);

which does actually include a semijoin. What seems to trigger the issue
is the join to the aggregated lineitem table - when replacing it with a
plain table, everything seems to be estimated perfectly fine.

Attached is a simple SQL script, that runs three variants of the query:

(1) with the join to the aggregated lineitem table
(2) with a join to a plain lineitem table
(3) with a join to a plain lineitem table and to 'part' (without the
semijoin)

First the queries are executed on tables without any foreign keys
(between those three), then with a FK between lineitem and partsupp, and
finally with additional FK between partsupp and part.

Initially the estimates are bad, but once the first foreign key is
added, the estimates get very accurate - except for the case (1).

I've only ran the queries on 10GB data set, but that should be enough.
The plans are from current master - I'll rerun the script on an older
release later today.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
q20.sql	application/sql	3.9 KB
q20.log	text/x-log	27.9 KB

In response to

Re: TPC-H Q20 from 1 hour to 19 hours! at 2017-03-29 19:00:01 from Robert Haas

Responses

Re: TPC-H Q20 from 1 hour to 19 hours! at 2017-03-30 00:00:45 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Petr Jelinek	2017-03-29 23:01:38	Re: logical replication access control patches
Previous Message	Jesper Pedersen	2017-03-29 21:06:13	Re: Page Scan Mode in Hash Index