From: | Kohei KaiGai <kaigai(at)heterodb(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Asymmetric partition-wise JOIN |
Date: | 2019-08-12 06:03:14 |
Message-ID: | CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
PostgreSQL optimizer right now considers join pairs on only
non-partition - non-partition or
partition-leaf - partition-leaf relations. On the other hands, it is
harmless and makes sense to
consider a join pair on non-partition - partition-leaf.
See the example below. ptable is partitioned by hash, and contains 10M
rows. ftable is not
partitioned and contains 50 rows. Most of ptable::fkey shall not have
matched rows in this
join.
create table ptable (fkey int, dist text) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
insert into ptable (select x % 10000, md5(x::text) from
generate_series(1,10000000) x);
create table ftable (pkey int primary key, memo text);
insert into ftable (select x, 'ftable__#' || x::text from
generate_series(1,50) x);
vacuum analyze;
postgres=# explain analyze select count(*) from ptable p, ftable f
where p.fkey = f.pkey;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
time=2333.193..2333.194 rows=1 loops=1)
-> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
time=0.056..2330.079 rows=50000 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Append (cost=0.00..233335.00 rows=10000000 width=4)
(actual time=0.012..1617.268 rows=10000000 loops=1)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
loops=1)
-> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
loops=1)
-> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4) (actual
time=0.033..0.034 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50 rows=50
width=4) (actual time=0.004..0.017 rows=50 loops=1)
Planning Time: 0.286 ms
Execution Time: 2333.264 ms
(12 rows)
We can manually rewrite this query as follows:
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
Because Append does not process tuples that shall have no matched
tuples in ftable,
this query has cheaper cost and short query execution time.
(2333ms --> 1396ms)
postgres=# explain analyze select count(*) from (
select * from ptable_p0 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p1 p, ftable f where p.fkey =
f.pkey union all
select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
time=1396.024..1396.024 rows=1 loops=1)
-> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
time=0.058..1393.008 rows=50000 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66
rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
-> Hash Join (cost=2.12..69856.40 rows=16726
width=72) (actual time=0.056..571.718 rows=16789 loops=1)
Hash Cond: (p.fkey = f.pkey)
-> Seq Scan on ptable_p0 p (cost=0.00..61101.96
rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.034..0.035 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f (cost=0.00..1.50
rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43
rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
-> Hash Join (cost=2.12..69861.26 rows=16617
width=72) (actual time=0.036..408.626 rows=16578 loops=1)
Hash Cond: (p_1.fkey = f_1.pkey)
-> Seq Scan on ptable_p1 p_1
(cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
rows=3333025 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.020..0.020 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_1
(cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
loops=1)
-> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84
rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
-> Hash Join (cost=2.12..69884.85 rows=16699
width=72) (actual time=0.025..406.048 rows=16633 loops=1)
Hash Cond: (p_2.fkey = f_2.pkey)
-> Seq Scan on ptable_p2 p_2
(cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
rows=3334179 loops=1)
-> Hash (cost=1.50..1.50 rows=50 width=4)
(actual time=0.014..0.014 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on ftable f_2
(cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
loops=1)
Planning Time: 0.614 ms
Execution Time: 1396.131 ms
(25 rows)
How about your opinions for this kind of asymmetric partition-wise
JOIN support by the optimizer?
I think we can harmlessly push-down inner-join and left-join if
partition-leaf is left side.
Probably, we need to implement two key functionalities.
1. Construction of RelOpInfo for join on non-partition table and
partition-leafs for each pairs.
Instead of JoinPaths, this logic adds AppendPath that takes
asymmetric partition-wise join
paths as sub-paths. Other optimization logic is equivalent as we
are currently doing.
2. Allow to share the hash-table built from table scan distributed to
individual partition leafs.
In the above example, SeqScan on ftable and relevant Hash path
will make identical hash-
table for the upcoming hash-join. If sibling paths have equivalent
results, it is reasonable to
reuse it.
Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai(at)heterodb(dot)com>
From | Date | Subject | |
---|---|---|---|
Next Message | Antonin Houska | 2019-08-12 08:56:36 | Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS) |
Previous Message | Kohei KaiGai | 2019-08-12 04:28:03 | Re: How to retain lesser paths at add_path()? |