From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | <> join selectivity estimate question |
Date: | 2017-03-17 05:54:46 |
Message-ID: | CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi hackers,
While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master. One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.
Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB. That is a query with a large anti-join and a large semi-join.
8 workers = 8.3s
7 workers = 8.2s
6 workers = 8.5s
5 workers = 8.9s
4 workers = 9.5s
3 workers = 39.7s
2 workers = 36.9s
1 worker = 38.2s
0 workers = 47.9s
Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).
The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded. Compare these two queries:
SELECT *
FROM lineitem l1
WHERE EXISTS (SELECT *
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey);
-> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)
SELECT *
FROM lineitem l1
WHERE EXISTS (SELECT *
FROM lineitem l2
WHERE l1.l_orderkey = l2.l_orderkey
AND l1.l_suppkey <> l2.l_suppkey);
-> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)
Or for a standalone example:
CREATE TABLE foo AS
SELECT (generate_series(1, 1000000) / 4)::int AS a,
(generate_series(1, 1000000) % 100)::int AS b;
ANALYZE foo;
SELECT *
FROM foo f1
WHERE EXISTS (SELECT *
FROM foo f2
WHERE f1.a = f2.a);
-> estimates 1,000,000 rows
SELECT *
FROM foo f1
WHERE EXISTS (SELECT *
FROM foo f2
WHERE f1.a = f2.a
AND f1.b <> f2.b);
-> estimates 1 row
I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch. What is happening here?
Thanks for reading.
--
Thomas Munro
http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
master_q21_3workers.txt | text/plain | 5.4 KB |
master_q21_4workers.txt | text/plain | 5.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Beena Emerson | 2017-03-17 06:08:56 | Re: increasing the default WAL segment size |
Previous Message | Kyotaro HORIGUCHI | 2017-03-17 05:23:13 | Re: Protect syscache from bloating with negative cache entries |