Quick Links

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From:	Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, ashutosh(dot)bapat(at)2ndquadrant(dot)com
Subject:	Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date:	2020-03-17 08:14:39
Message-ID:	CAPmGK15L2RshRJA2y3zSi6EKKPe2wOKT56od2KRR8H3M4cqOFg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Ashutosh,

On Wed, Mar 4, 2020 at 1:48 AM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> I reviewed the patch. Except for the logic of matching the pairs of partitions from already merged partitions, I think the code changes are good. But there are several places where it needs further changes to comments. The attached patch has those. I have described some of them below.

Thanks for reviewing!

> + * We can not perform partition-wise join if either of the joining
> + * relations is not partitioned.
>
> We are consistently using partitionwise instead of partition-wise.

Will fix.

> + /*
> + * See if the partition bounds for inputs are exactly the same, in
> + * which case we don't need to work hard: partitions with the same
> + * partition indexes will form join pairs, and the join rel will have
> + * the same partition bounds as inputs; otherwise try to merge the
> + * partition bounds along with generating join pairs.
>
> Phrase "joining relations" is better than "inputs", IMO. Updated in the
> attached patch.

"inputs" is used in many places in the planner performing join
planning, so I'm not sure "joining relations" is better than "inputs".

> + /*
> + * If the partition bounds for the join rel are not merged ones,
> + * inputs are guaranteed to have the same partition bounds, so
> + * partitions with the same partition indexes will form join pairs;
> + * else let get_matching_part_pairs() do the work.
> + */
> + if (joinrel->merged)
> + {
>
> This condition in the comment is opposite to the condition being checked in
> code, so looks confusing. BTW this comment is also repeated around line 1405.
> See attached patch for correction.

OK, I'll revise the comments as proposed.

> + /*
> + * If this segment of the join is empty, it means that this segment
>
> "partition of the join" looks consistent with other usages than "segment of the
> join".

Actually, "segment" is used in the existing comments in the caller
function try_partitionwise_join(), so I think "segment" is better here
for consistency.

> + /*
> + * Get a relids set of partition(s) involved in this join segment that
> + * are from the rel1 side.
> + */
> + child_relids1 = bms_intersect(child_joinrel->relids,
> + rel1->all_partrels);
>
> The partition bounds are sorted by their values. Even for a list partitioned
> table, the bounds are sorted by the least partition value. We do not allow
> multiple paritions from one side to be joined with one partition on the other
> and vice versa. All this together means that the partitions of the join
> relation are formed by joining partitions from either side in the order of
> their bounds. This means that the matching pairs of partitions can be found by
> matching relids of partitions of join with those of the joining relation by
> traversing partitions from all the three relations only once in the order they
> appears in partition bounds of corresponding relations.

Consider this 2-way join for list partitioned tables:

CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002');
INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
generate_series(1, 100) i WHERE i % 10 in (1, 2, 3);
ANALYZE plt1_ad;
CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004');
CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003');
INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4);
ANALYZE plt2_ad;

EXPLAIN (COSTS OFF)
SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c
= t2.c) WHERE t1.a < 10 ORDER BY t1.c, t1.a, t2.a;
QUERY PLAN
-----------------------------------------------------
Sort
Sort Key: t1.c, t1.a, t2.a
-> Append
-> Hash Join
Hash Cond: (t2_1.c = t1_2.c)
-> Seq Scan on plt2_ad_p1 t2_1
-> Hash
-> Seq Scan on plt1_ad_p2 t1_2
Filter: (a < 10)
-> Hash Join
Hash Cond: (t2_2.c = t1_1.c)
-> Seq Scan on plt2_ad_p2 t2_2
-> Hash
-> Seq Scan on plt1_ad_p1 t1_1
Filter: (a < 10)
(15 rows)

As you can see from the EXPLAIN result, the first partition on the
outer side matches the second partition on the inner side, and the
second partition on the outer side matches the first partition on the
inner side. How does the algorithm you proposed work e.g., when an
N-way join for list partitioned tables contains this join as its lower
join?

> If we use this
> algorithm, we don't need all_partrels to be collected and also don't need to
> search base or join relation. That, I think, will reduce the time and space
> complexity of this logic. Am I missing something?

Relids is used for storing all_partrels, so the time/space cost of
handling it would be small. Also, the cost of searing base relations
would be small. The cost of searching join relations would be a bit
large in some cases, but I thought that would be acceptable, compared
with large overhead of performing other part of partitionwise join
planning.

> + if (rel1_is_simple)
>
> This variable is used only in one place. So instead we should the expression
> assigning the value to it. Changed in the attached patch.

I don't think that's a good idea, because this check is done
repeatedly in a for loop.

> - rel->nparts = 0;
> + rel->nparts = -1;
>
> I think we need to add comments around various values that nparts can take. How
> about like something attached.

> + case PARTITION_STRATEGY_HASH:
> + merged_bounds = NULL;
>
> I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
> the reason is thus: When the modulo of used for partition mapping (i.e. maximum
> number of has partitions) is same, their partition bounds are same and do not
> need merging.

I don't think that's always true; there are cases where the moduli are
the same, but the partition bounds are not, because it's possible to
only define partitions for some of the remainders. See the discussion
in [1].

> If the maximum number of partitions is different for both the
> joining relations, there's high probability that one partition on one side will
> join with multiple partitions on the other side. So exact partition bounds
> match will work in most of the cases. The cases otherwise are not so common to
> spend the effort in coding and planning.
>
> I have added this explanation in the patch.

I also think it would be great if we can perform generic partitionwise
join for hash partitioned tables, so I'd like to propose to add
something like this, instead: "Currently we support partitionwise join
for hash partitioned tables only when the partition bounds for them
exactly match, but later it might be worth the effort to relax the
restriction."

> + if (part_index == -1)
> + return -1;
> + } while (is_dummy_partition(rel, part_index));
>
> I understand why we are skipping NULL positions. I am not sure why are we are
> skipping over RelOptInfos which exist but are marked as dummy; we can still
> create a join pair with those partitions.

Yeah, but I think it's safe to skip over those partitions as well,
because such a join pair can be created using
merge_partition_with_dummy().

> +/*
> + * get_merged_range_bounds
> + * Given the bounds of range partitions to be join, determine the range
>
> s/join/joined/

Good catch! Will fix.

> There are more changes to comments, where I thought that the comments are
> required or existing comments need more clarification. Please review the
> attached patch. This patch is created on top of
> v32-0001-Improve-partition-matching-for-partitionwise-join.

Thanks for the patch! I will review the patch ASAP.

Sorry for the delay.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAAJ_b94tJTix3kR8uBjin-ruJ-7ojn-gAWJQRicbLqAttQTe1g%40mail.gmail.com

In response to

Re: [HACKERS] advanced partition matching algorithm for partition-wise join at 2020-03-03 16:48:40 from Ashutosh Bapat

Responses

Re: [HACKERS] advanced partition matching algorithm for partition-wise join at 2020-03-23 13:41:54 from Ashutosh Bapat

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2020-03-17 08:37:51	Missing errcode() in ereport
Previous Message	Ivan N. Taranov	2020-03-17 07:38:29	Re: custom postgres launcher for tests