Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-04-06 01:07:11
Message-ID: CA+TgmobHYtw7P3nmNYSb3yNbrKsRs-wabvcVf1-Zr2Xdn2SE7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Only inner join conditions have equivalence classes associated with
> those. Outer join conditions create single element equivalence
> classes. So, we can not associate equivalence classes as they are with
> partition scheme. If we could do that, it makes life much easier since
> checking whether equi-join between all partition keys exist, is simply
> looking up equivalence classes that cover joining relations and find
> em_member corresponding to partition keys.

OK.

> It looks like we should only keep strategy, partnatts, partopfamily
> and parttypcoll in PartitionScheme. A partition-wise join between two
> relations would be possible if all those match.

Yes, I think so. Conceivably you could even exclude partnatts and
strategy, since there's nothing preventing a partitionwise join
between a list-partitioned table and a range-partitioned table, or
between a table range-partitioned on (a) and another range-partitioned
on (a, b), but there is probably not much benefit in trying to cover
such cases. I think it's reasonable to tell users that this is only
going to work when the partitioning strategy is the same and the join
conditions include all of the partitioning columns on both sides.

> There's a relevant comment in 0006, build_joinrel_partition_info()
> (probably that name needs to change, but I will do that once we have
> settled on design)
> + /*
> + * Construct partition keys for the join.
> + *
> + * An INNER join between two partitioned relations is partition by key
> + * expressions from both the relations. For tables A and B
> partitioned by a and b
> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
> + * and B.b.
> + *
> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
> + * B.b NULL. These rows may not fit the partitioning conditions imposed on
> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
> + * Strictly speaking, partition keys of an OUTER join should include
> + * partition key expressions from the OUTER side only. Consider a join like
> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
> + * include B.b as partition key expression for (AB), it prohibits us from
> + * using partition-wise join when joining (AB) with C as there is no
> + * equi-join between partition keys of joining relations. But two NULL
> + * values are never equal and no two rows from mis-matching partitions can
> + * join. Hence it's safe to include B.b as partition key expression for
> + * (AB), even though rows in (AB) are not strictly partitioned by B.b.
> + */
>
> I think that also needs to be reviewed carefully.

The following passage from src/backend/optimizer/README seems highly relevant:

===
The planner's treatment of outer join reordering is based on the following
identities:

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B). If Pbc is not
strict, the first form might produce some rows with nonnull C columns
where the second form would make those entries null.
===

In other words, I think your statement that null is never equal to
null is a bit imprecise. Somebody could certainly create an operator
that is named "=" which returns true in that case, and then they could
say, hey, two nulls are equal (when you use that operator). The
argument needs to be made in terms of the formal properties of the
operator. The relevant logic is in have_partkey_equi_join:

+ /* Skip clauses which are not equality conditions. */
+ if (rinfo->hashjoinoperator == InvalidOid &&
!rinfo->mergeopfamilies)
+ continue;

Actually, I think the hashjoinoperator test is formally and
practically unnecessary here; lower down there is a test to see
whether the partitioning scheme's operator family is a member of
rinfo->mergeopfamilies, which will certainly fail if we got through
this test with rinfo->mergeopfamilies == NIL just on the strength of
rinfo->hashjoinoperator != InvalidOid. So you can just bail out if
rinfo->mergeopfamilies == NIL. But the underlying point here is that
the only thing you really know about the function is that it's got to
be a strategy-3 operator in some btree opclass; if that guarantees
strictness, then so be it -- but I wasn't able to find anything in the
code or documentation off-hand that supports that contention, so we
might need to think a bit more about why (or if) this is guaranteed to
be true.

> Partition-wise joins
> may be happy including partition keys from all sides, but
> partition-wise aggregates may not be, esp. when pushing complete
> aggregation down to partitions. In that case, rows with NULL partition
> key, which falls on nullable side of join, will be spread across
> multiple partitions. Proabably, we should separate nullable and
> non-nullable partition key expressions.

I don't think I understand quite what you're getting at here. Can you
spell this out in more detail? To push an aggregate down to
partitions, you need the grouping key to match the applicable
partition key, and the partition key shouldn't allow nulls in more
than one place. Now I think your point may be that outer join
semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
good test case for partitionwise aggregate. I'd be inclined to think
that we should just give up on partitionwise aggregate in such cases;
it's not worth trying to optimize such a weird query, at least IMHO.
(Does this sort of case ever happen with joins? I think not, as long
as the join operator is strict.)

I spent some time thinking about this patch set today and I don't see
that there's much point in committing any more of this to v10. I
think that 0001 and 0002 are probably committable or very close at
this point. However, 0001 is adding more complexity than I think is
warranted until we're actually ready to commit the feature that uses
it, and 0002 is so small that committing isn't really going to smooth
future development much. 0003-0009 are essentially all one big patch
that will have to be committed together.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-04-06 01:07:59 Re: Re: new set of psql patches for loading (saving) data from (to) text, binary files
Previous Message David Rowley 2017-04-06 01:05:24 Re: [COMMITTERS] pgsql: Collect and use multi-column dependency stats