From: | Kohei KaiGai <kaigai(at)heterodb(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Asymmetric partition-wise JOIN |
Date: | 2019-08-22 16:05:19 |
Message-ID: | CAOP8fzY9p5==63EDdAb14p1=eUxtqvVHrCXtEVnpnyJ5Mzq96Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
Even though nobody has respond the thread, I tried to make a prototype of
the asymmetric partition-wise join support.
This feature tries to join non-partitioned and partitioned relation
before append.
See the example below:
create table ptable (dist int, a int, b int) partition by hash (dist);
create table ptable_p0 partition of ptable for values with (modulus 3,
remainder 0);
create table ptable_p1 partition of ptable for values with (modulus 3,
remainder 1);
create table ptable_p2 partition of ptable for values with (modulus 3,
remainder 2);
create table t1 (aid int, label text);
create table t2 (bid int, label text);
insert into ptable (select x, (1000*random())::int,
(1000*random())::int from generate_series(1,1000000) x);
insert into t1 (select x, md5(x::text) from generate_series(1,50) x);
insert into t2 (select x, md5(x::text) from generate_series(1,50) x);
vacuum analyze ptable;
vacuum analyze t1;
vacuum analyze t2;
ptable.a has values between 0 and 1000, and t1.aid has values between 1 and 50.
Therefore, tables join on ptable and t1 by a=aid can reduce almost 95% rows.
On the other hands, t1 is not partitioned and join-keys are not partition keys.
So, Append must process million rows first, then HashJoin processes
the rows read
from the partitioned table, and 95% of them are eventually dropped.
On the other words, 95% of jobs by Append are waste of time and CPU cycles.
postgres=# explain select * from ptable, t1 where a = aid;
QUERY PLAN
------------------------------------------------------------------------------
Hash Join (cost=2.12..24658.62 rows=49950 width=49)
Hash Cond: (ptable_p0.a = t1.aid)
-> Append (cost=0.00..20407.00 rows=1000000 width=12)
-> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12)
-> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12)
-> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
(8 rows)
The asymmetric partitionwise join allows to join non-partitioned tables and
partitioned tables prior to Append.
postgres=# set enable_partitionwise_join = on;
SET
postgres=# explain select * from ptable, t1 where a = aid;
QUERY PLAN
------------------------------------------------------------------------------
Append (cost=2.12..19912.62 rows=49950 width=49)
-> Hash Join (cost=2.12..6552.96 rows=16647 width=49)
Hash Cond: (ptable_p0.a = t1.aid)
-> Seq Scan on ptable_p0 (cost=0.00..5134.63 rows=333263 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
-> Hash Join (cost=2.12..6557.29 rows=16658 width=49)
Hash Cond: (ptable_p1.a = t1.aid)
-> Seq Scan on ptable_p1 (cost=0.00..5137.97 rows=333497 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
-> Hash Join (cost=2.12..6552.62 rows=16645 width=49)
Hash Cond: (ptable_p2.a = t1.aid)
-> Seq Scan on ptable_p2 (cost=0.00..5134.40 rows=333240 width=12)
-> Hash (cost=1.50..1.50 rows=50 width=37)
-> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37)
(16 rows)
We can consider the table join ptable X t1 above is equivalent to:
(ptable_p0 + ptable_p1 + ptable_p2) X t1
= (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
It returns an equivalent result, however, rows are already reduced by HashJoin
in the individual leaf of Append, so CPU-cycles consumed by Append node can
be cheaper.
On the other hands, it has a downside because t1 must be read 3 times and
hash table also must be built 3 times. It increases the expected cost,
so planner
may not choose the asymmetric partition-wise join plan.
One idea I have is, sibling HashJoin shares a hash table that was built once
by any of the sibling Hash plan. Right now, it is not implemented yet.
How about your thought for this feature?
Best regards,
2019年8月12日(月) 15:03 Kohei KaiGai <kaigai(at)heterodb(dot)com>:
>
> Hello,
>
> PostgreSQL optimizer right now considers join pairs on only
> non-partition - non-partition or
> partition-leaf - partition-leaf relations. On the other hands, it is
> harmless and makes sense to
> consider a join pair on non-partition - partition-leaf.
>
> See the example below. ptable is partitioned by hash, and contains 10M
> rows. ftable is not
> partitioned and contains 50 rows. Most of ptable::fkey shall not have
> matched rows in this
> join.
>
> create table ptable (fkey int, dist text) partition by hash (dist);
> create table ptable_p0 partition of ptable for values with (modulus 3,
> remainder 0);
> create table ptable_p1 partition of ptable for values with (modulus 3,
> remainder 1);
> create table ptable_p2 partition of ptable for values with (modulus 3,
> remainder 2);
> insert into ptable (select x % 10000, md5(x::text) from
> generate_series(1,10000000) x);
>
> create table ftable (pkey int primary key, memo text);
> insert into ftable (select x, 'ftable__#' || x::text from
> generate_series(1,50) x);
> vacuum analyze;
>
> postgres=# explain analyze select count(*) from ptable p, ftable f
> where p.fkey = f.pkey;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=266393.38..266393.39 rows=1 width=8) (actual
> time=2333.193..2333.194 rows=1 loops=1)
> -> Hash Join (cost=2.12..260143.38 rows=2500000 width=0) (actual
> time=0.056..2330.079 rows=50000 loops=1)
> Hash Cond: (p.fkey = f.pkey)
> -> Append (cost=0.00..233335.00 rows=10000000 width=4)
> (actual time=0.012..1617.268 rows=10000000 loops=1)
> -> Seq Scan on ptable_p0 p (cost=0.00..61101.96
> rows=3332796 width=4) (actual time=0.011..351.137 rows=3332796
> loops=1)
> -> Seq Scan on ptable_p1 p_1 (cost=0.00..61106.25
> rows=3333025 width=4) (actual time=0.005..272.925 rows=3333025
> loops=1)
> -> Seq Scan on ptable_p2 p_2 (cost=0.00..61126.79
> rows=3334179 width=4) (actual time=0.006..416.141 rows=3334179
> loops=1)
> -> Hash (cost=1.50..1.50 rows=50 width=4) (actual
> time=0.033..0.034 rows=50 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 10kB
> -> Seq Scan on ftable f (cost=0.00..1.50 rows=50
> width=4) (actual time=0.004..0.017 rows=50 loops=1)
> Planning Time: 0.286 ms
> Execution Time: 2333.264 ms
> (12 rows)
>
> We can manually rewrite this query as follows:
>
> postgres=# explain analyze select count(*) from (
> select * from ptable_p0 p, ftable f where p.fkey =
> f.pkey union all
> select * from ptable_p1 p, ftable f where p.fkey =
> f.pkey union all
> select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
>
> Because Append does not process tuples that shall have no matched
> tuples in ftable,
> this query has cheaper cost and short query execution time.
> (2333ms --> 1396ms)
>
> postgres=# explain analyze select count(*) from (
> select * from ptable_p0 p, ftable f where p.fkey =
> f.pkey union all
> select * from ptable_p1 p, ftable f where p.fkey =
> f.pkey union all
> select * from ptable_p2 p, ftable f where p.fkey = f.pkey) subqry;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=210478.25..210478.26 rows=1 width=8) (actual
> time=1396.024..1396.024 rows=1 loops=1)
> -> Append (cost=2.12..210353.14 rows=50042 width=0) (actual
> time=0.058..1393.008 rows=50000 loops=1)
> -> Subquery Scan on "*SELECT* 1" (cost=2.12..70023.66
> rows=16726 width=0) (actual time=0.057..573.197 rows=16789 loops=1)
> -> Hash Join (cost=2.12..69856.40 rows=16726
> width=72) (actual time=0.056..571.718 rows=16789 loops=1)
> Hash Cond: (p.fkey = f.pkey)
> -> Seq Scan on ptable_p0 p (cost=0.00..61101.96
> rows=3332796 width=4) (actual time=0.009..255.791 rows=3332796
> loops=1)
> -> Hash (cost=1.50..1.50 rows=50 width=4)
> (actual time=0.034..0.035 rows=50 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 10kB
> -> Seq Scan on ftable f (cost=0.00..1.50
> rows=50 width=4) (actual time=0.004..0.019 rows=50 loops=1)
> -> Subquery Scan on "*SELECT* 2" (cost=2.12..70027.43
> rows=16617 width=0) (actual time=0.036..409.712 rows=16578 loops=1)
> -> Hash Join (cost=2.12..69861.26 rows=16617
> width=72) (actual time=0.036..408.626 rows=16578 loops=1)
> Hash Cond: (p_1.fkey = f_1.pkey)
> -> Seq Scan on ptable_p1 p_1
> (cost=0.00..61106.25 rows=3333025 width=4) (actual time=0.005..181.422
> rows=3333025 loops=1)
> -> Hash (cost=1.50..1.50 rows=50 width=4)
> (actual time=0.020..0.020 rows=50 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 10kB
> -> Seq Scan on ftable f_1
> (cost=0.00..1.50 rows=50 width=4) (actual time=0.004..0.011 rows=50
> loops=1)
> -> Subquery Scan on "*SELECT* 3" (cost=2.12..70051.84
> rows=16699 width=0) (actual time=0.025..407.103 rows=16633 loops=1)
> -> Hash Join (cost=2.12..69884.85 rows=16699
> width=72) (actual time=0.025..406.048 rows=16633 loops=1)
> Hash Cond: (p_2.fkey = f_2.pkey)
> -> Seq Scan on ptable_p2 p_2
> (cost=0.00..61126.79 rows=3334179 width=4) (actual time=0.004..181.015
> rows=3334179 loops=1)
> -> Hash (cost=1.50..1.50 rows=50 width=4)
> (actual time=0.014..0.014 rows=50 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 10kB
> -> Seq Scan on ftable f_2
> (cost=0.00..1.50 rows=50 width=4) (actual time=0.003..0.008 rows=50
> loops=1)
> Planning Time: 0.614 ms
> Execution Time: 1396.131 ms
> (25 rows)
>
> How about your opinions for this kind of asymmetric partition-wise
> JOIN support by the optimizer?
> I think we can harmlessly push-down inner-join and left-join if
> partition-leaf is left side.
>
> Probably, we need to implement two key functionalities.
> 1. Construction of RelOpInfo for join on non-partition table and
> partition-leafs for each pairs.
> Instead of JoinPaths, this logic adds AppendPath that takes
> asymmetric partition-wise join
> paths as sub-paths. Other optimization logic is equivalent as we
> are currently doing.
> 2. Allow to share the hash-table built from table scan distributed to
> individual partition leafs.
> In the above example, SeqScan on ftable and relevant Hash path
> will make identical hash-
> table for the upcoming hash-join. If sibling paths have equivalent
> results, it is reasonable to
> reuse it.
>
> Best regards,
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <kaigai(at)heterodb(dot)com>
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai(at)heterodb(dot)com>
Attachment | Content-Type | Size |
---|---|---|
pgsql13-asymmetric-partitionwise-join.v1.patch | application/octet-stream | 20.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Anastasia Lubennikova | 2019-08-22 16:13:05 | Re: standby recovery fails (tablespace related) (tentative patch and discussion) |
Previous Message | Pavel Stehule | 2019-08-22 15:56:46 | Re: Why overhead of SPI is so large? |