From: | Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Partition-wise join for join between (declaratively) partitioned tables |
Date: | 2017-04-10 07:11:25 |
Message-ID: | CAFjFpRcXN+3qZxHXzEsiApS6kLytf6C6VxkzpzPekXDM2Lq-5A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Only inner join conditions have equivalence classes associated with
>> those. Outer join conditions create single element equivalence
>> classes. So, we can not associate equivalence classes as they are with
>> partition scheme. If we could do that, it makes life much easier since
>> checking whether equi-join between all partition keys exist, is simply
>> looking up equivalence classes that cover joining relations and find
>> em_member corresponding to partition keys.
>
> OK.
>
>> It looks like we should only keep strategy, partnatts, partopfamily
>> and parttypcoll in PartitionScheme. A partition-wise join between two
>> relations would be possible if all those match.
>
> Yes, I think so. Conceivably you could even exclude partnatts and
> strategy, since there's nothing preventing a partitionwise join
> between a list-partitioned table and a range-partitioned table, or
> between a table range-partitioned on (a) and another range-partitioned
> on (a, b), but there is probably not much benefit in trying to cover
> such cases. I think it's reasonable to tell users that this is only
> going to work when the partitioning strategy is the same and the join
> conditions include all of the partitioning columns on both sides.
>
>> There's a relevant comment in 0006, build_joinrel_partition_info()
>> (probably that name needs to change, but I will do that once we have
>> settled on design)
>> + /*
>> + * Construct partition keys for the join.
>> + *
>> + * An INNER join between two partitioned relations is partition by key
>> + * expressions from both the relations. For tables A and B
>> partitioned by a and b
>> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
>> + * and B.b.
>> + *
>> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
>> + * B.b NULL. These rows may not fit the partitioning conditions imposed on
>> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>> + * Strictly speaking, partition keys of an OUTER join should include
>> + * partition key expressions from the OUTER side only. Consider a join like
>> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
>> + * include B.b as partition key expression for (AB), it prohibits us from
>> + * using partition-wise join when joining (AB) with C as there is no
>> + * equi-join between partition keys of joining relations. But two NULL
>> + * values are never equal and no two rows from mis-matching partitions can
>> + * join. Hence it's safe to include B.b as partition key expression for
>> + * (AB), even though rows in (AB) are not strictly partitioned by B.b.
>> + */
>>
>> I think that also needs to be reviewed carefully.
>
> The following passage from src/backend/optimizer/README seems highly relevant:
>
> ===
> The planner's treatment of outer join reordering is based on the following
> identities:
>
> 1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
> = (A innerjoin C on (Pac)) leftjoin B on (Pab)
>
> where Pac is a predicate referencing A and C, etc (in this case, clearly
> Pac cannot reference B, or the transformation is nonsensical).
>
> 2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
> = (A leftjoin C on (Pac)) leftjoin B on (Pab)
>
> 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
> = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
>
> Identity 3 only holds if predicate Pbc must fail for all-null B rows
> (that is, Pbc is strict for at least one column of B). If Pbc is not
> strict, the first form might produce some rows with nonnull C columns
> where the second form would make those entries null.
> ===
>
> In other words, I think your statement that null is never equal to
> null is a bit imprecise. Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator). The
> argument needs to be made in terms of the formal properties of the
> operator. The relevant logic is in have_partkey_equi_join:
>
> + /* Skip clauses which are not equality conditions. */
> + if (rinfo->hashjoinoperator == InvalidOid &&
> !rinfo->mergeopfamilies)
> + continue;
>
> Actually, I think the hashjoinoperator test is formally and
> practically unnecessary here; lower down there is a test to see
> whether the partitioning scheme's operator family is a member of
> rinfo->mergeopfamilies, which will certainly fail if we got through
> this test with rinfo->mergeopfamilies == NIL just on the strength of
> rinfo->hashjoinoperator != InvalidOid. So you can just bail out if
> rinfo->mergeopfamilies == NIL. But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.
I need more time to think about this. Will get back to this soon.
>
>> Partition-wise joins
>> may be happy including partition keys from all sides, but
>> partition-wise aggregates may not be, esp. when pushing complete
>> aggregation down to partitions. In that case, rows with NULL partition
>> key, which falls on nullable side of join, will be spread across
>> multiple partitions. Proabably, we should separate nullable and
>> non-nullable partition key expressions.
>
> I don't think I understand quite what you're getting at here. Can you
> spell this out in more detail? To push an aggregate down to
> partitions, you need the grouping key to match the applicable
> partition key, and the partition key shouldn't allow nulls in more
> than one place. Now I think your point may be that outer join
> semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
> FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
> good test case for partitionwise aggregate. I'd be inclined to think
> that we should just give up on partitionwise aggregate in such cases;
> it's not worth trying to optimize such a weird query, at least IMHO.
> (Does this sort of case ever happen with joins? I think not, as long
> as the join operator is strict.)
Yes, this is the case, I am thinking about. No, it doesn't happen with join.
>
> I spent some time thinking about this patch set today and I don't see
> that there's much point in committing any more of this to v10. I
> think that 0001 and 0002 are probably committable or very close at
> this point. However, 0001 is adding more complexity than I think is
> warranted until we're actually ready to commit the feature that uses
> it, and 0002 is so small that committing isn't really going to smooth
> future development much. 0003-0009 are essentially all one big patch
> that will have to be committed together.
Ok. Thanks.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2017-04-10 07:13:45 | Re: Compiler warning in costsize.c |
Previous Message | Craig Ringer | 2017-04-10 05:57:38 | Re: SCRAM authentication, take three |