Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-07-31 11:59:31
Message-ID: CAFjFpRf-9QKAi7jFRNQ2oYK3HqOWkPkTTh+mB9kM9TUQTvsvdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Here's revised patch set with only 0004 revised. That patch deals with
>> creating multi-level inheritance hierarchy from multi-level partition
>> hierarchy. The original logic of recursively calling
>> inheritance_planner()'s guts over the inheritance hierarchy required
>> that for every such recursion we flatten many lists created by that
>> code. Recursion also meant that root->append_rel_list is traversed as
>> many times as the number of partitioned partitions in the hierarchy.
>> Instead the revised version keep the iterative shape of
>> inheritance_planner() intact, thus naturally creating flat lists,
>> iterates over root->append_rel_list only once and is still easy to
>> read and maintain.
>
> 0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004:
>
> + if (brel->reloptkind != RELOPT_BASEREL &&
> + brte->relkind != RELKIND_PARTITIONED_TABLE)
>
> I spent a lot of time staring at this code before I figured out what
> was going on here. We're iterating over simple_rel_array, so the
> reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
> But does that guarantee that rtekind is RTE_RELATION such that
> brte->relkind will be initialized to a value? I'm not 100% sure.

Comment in RangeTblEntry says
952 /*
953 * Fields valid for a plain relation RTE (else zero):
954 *
... clipped portion for RTE_NAMEDTUPLESTORE related comment

960 Oid relid; /* OID of the relation */
961 char relkind; /* relation kind (see pg_class.relkind) */

This means that relkind will be 0 when rtekind != RTE_RELATION. So,
the condition holds. But code creating an RTE somewhere which is not
in sync with this comment would create a problem. So your suggestion
makes sense.

> I
> think it would be clearer to write this test like this:
>
> Assert(IS_SIMPLE_REL(brel));
> if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
> (brte->rtekind != RELOPT_BASEREL ||

Do you mean (brte_>rtekind != RTE_RELATION)?

> brte->relkind != RELKIND_PARTITIONED_TABLE))
> continue;
>
> Note that the way you wrote the comment is says if it *is* another
> REL, not if it's *not* a baserel; it's good if those kinds of little
> details match between the code and the comments.

I find the existing comment and code in this part of the function
differ. The comment just above the loop on simple_rel_array[], talks
about changing something in the child, but the very next line skips
child relations and later a loop on append_rel_list makes changes to
appropriate children. I guess, it's done that way to keep the code
working even after we introduce some RELOPTKIND other than BASEREL or
OTHER_MEMBER_REL for a simple rel. But your suggestion makes more
sense. Changed it according to your suggestion.

>
> It is not clear to me what the motivation is for the API changes in
> expanded_inherited_rtentry. They don't appear to be necessary.

expand_inherited_rtentry() creates AppendRelInfos for all the children
of a given parent and collects them in a list. The list is appended to
root->append_rel_list at the end of the function. Now that function
needs to do this recursively. This means that for a partitioned
partition table its children's AppendRelInfos will be added to
root->append_rel_list before AppendRelInfo of that partitioned
partition table. inheritance_planner() assumes that the parent's
AppendRelInfo comes before its children in append_rel_list.This
assumption allows it to be simplified greately, retaining its
iterative form. My earlier patches had recursive version of
inheritance_planner(), which is complex. I have comments in this patch
explaining this.

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

> If
> they are necessary, you need to do a more thorough job updating the
> comments. This one, in particular:
>
> * If so, add entries for all the child tables to the query's
> * rangetable, and build AppendRelInfo nodes for all the child tables
> * and add them to root->append_rel_list. If not, clear the entry's

Done.

>
> And the comments could maybe say something like "We return the list of
> appinfos rather than directly appending it to append_rel_list because
> $REASON."

Done. Please check the attached version.

>
> - * is a partitioned table.
> + * RTE simply duplicates the parent *partitioned* table.
> */
> - if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
> + if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)
>
> This is another case where it's hard to understand the test from the comments.

The current comment says it all, but it very cryptic manner.
1526 /*
1527 * Build an AppendRelInfo for this parent and child,
unless the child
1528 * RTE simply duplicates the parent *partitioned* table.
1529 */

The comment makes sense in the context of this paragraph in the prologue
1364 * Note that the original RTE is considered to represent the whole
1365 * inheritance set. The first of the generated RTEs is an RTE for the same
1366 * table, but with inh = false, to represent the parent table in its role
1367 * as a simple member of the inheritance set.
1368 *

The code avoids creating AppendRelInfos for a child which represents
the parent in its role as a simple member of inheritance set.

I have reworded it as
1526 /*
1527 * Build an AppendRelInfo for this parent and child,
unless the child
1528 * RTE represents the parent as a simple member of inheritance set.
1529 */

>
> + * In case of multi-level inheritance hierarchy, for every child we require
> + * PlannerInfo of its immediate parent. Hence we save those in a an array
>
> Say why. Also, need to fix "a an".

Done.

>
> I'm a little bit surprised that this patch doesn't make any changes to
> allpaths.c or relnode.c.

> It looks to me like we'll generate paths for
> the new RTEs that are being added. Are we going to end up with
> multiple levels of Append nodes, then? Does the consider the way
> consider_parallel is propagated up and down in set_append_rel_size()
> and set_append_rel_pathlist() really work with multiple levels? Maybe
> this is all fine; I haven't tried to verify it in depth.

This has been discussed before, but I can not locate the mail
answering these questions. accumulate_append_subpath() called from
add_paths_to_append_rel() takes care of flattening Merge/Append paths.
The planner code deals with the multi-level inheritance hierarchy
created for subqueries with set operations. There is code in relnode.c
to build the RelOptInfos for such subqueries recursively through using
RangeTblEntry::inh flag. So there are no changes in allpaths.c and
relnode.c. Are you looking for some other changes?

>
> Overall I think this is a reasonable direction to go but I'm worried
> that there may be bugs lurking -- other code that needs adjusting that
> hasn't been found, really.
>

Planner code is already aware of such hierarchies except DMLs, which
this patch adjusts. We have fixed issues revealed by mine and
Rajkumar's testing.
What kinds of things you suspect?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Beena Emerson 2017-07-31 12:28:51 Re: Default Partition for Range
Previous Message Heikki Linnakangas 2017-07-31 11:54:22 Re: PostgreSQL not setting OpenSSL session id context?