Re: Partitionwise join fails under GEQO

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Partitionwise join fails under GEQO
Date: 2020-10-05 12:52:10
Message-ID: CAExHW5tgiLsYC_CLcaKHFFc8H56C0s9mCu_0OpahGxn=hUi_Pg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 5, 2020 at 6:47 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> If you run the core regression tests with geqo_threshold = 2
> (injected, say, via ALTER SYSTEM SET), you will notice the
> earlier tests showing some join ordering differences, which
> is unsurprising ... but when it gets to partition_join, that
> test just dumps core.
>
> Investigation shows that the crash is due to a corrupt EquivalenceClass
> member. It gets that way because add_child_join_rel_equivalences
> is unaware of the planner's memory allocation policies, and feels
> free to mess with the EC data structures during join planning.
> When GEQO's transient memory context is thrown away after one round
> of GEQO planning, some still-referenced EC data goes with it,
> resulting in a crash during the next round.
>
> I band-aided around that with the attached patch, which is enough
> to prevent the crash, but it's entirely unsatisfactory as a production
> solution. The problem is that each GEQO round will re-add the same
> child EC members, since add_child_join_rel_equivalences has absolutely
> no concept that the members it wants might be there already. Since
> GEQO tends to use a lot of rounds, this can run to significant memory
> bloat, not to mention a pretty serious speed hit while EC operations
> thumb through all of those duplicate expressions.
>
> Now that I've seen this, I wonder whether add_child_join_rel_equivalences
> might not be making duplicate EC entries even without GEQO. Is there any
> guarantee that it's not called repeatedly on related join-rel sets?

build_child_join_rel() is the only caller of
add_child_join_rel_equivalences(). Before the first calls the later,
the first has an Assert
899 Assert(!find_join_rel(root, joinrel->relids));
This means that for a given child join rel this function is called
only once implying that add_child_join_rel_equivalences() is called
only once for a given child join rel. Initially I thought that this is
enough to not add duplicate child members. But to me use of
bms_overlap() in that function looks doubtful. Thinking allowed, let's
say there's an EC member with em_relids = {1, 2} 1, 2 being relids of
two partitioned tables. Let's assume that there's a third partitioned
table with relid = 3. When a child join rel between children, say 6
and 7, of 1 and 2 resp was produces the EC memeber with em_relids =
{1,2} was translated to produce and EC memeber with em_relids = {6, 7}
and em_is_child = true. But when we will add a child join rel between
(6, 7, 8) of a join between (1, 2, 3), bms_overlap({1, 2}, {1, 2, 3})
will return true and we will add one more member with em_relids = {6,
7}. So your suspicion is true. I think, using
bms_is_subset(top_parent_relids, em->em_relids) should fix that
problem. In addition, adding an Assert to make sure that we are not
adding multiple instances of same child EC member should suffice.
Thumbing through all the child EC members to eliminate duplicates
would be much costlier when a huge number of partitions are involved.

On a related but different subject, I have been thinking on-off about
the need for expression translation while planning partitions.The
translated expressions differ only in the leaf-vars and even for most
of the partitioned tables really only the varno (assuming that
partitioned table and partitions will have same layouts when a large
number of partitions are created automatically.) If we can introduce a
new type of (polymorphic) Var node which represents not just the
parent Var but also child Vars as well. The whole need for expression
translation vanishes, reducing memory requirements by many folds. Such
a Var node will indicate which varnos it covers and for a group of
varnos which varattno it represents. In most of the cases where parent
and the child share the same varattno, all the varnos form a single
set. Whole Var references pose a problem since parent's whole var
reference translates to a RowExpr containing child Var references, but
probably using varattno = 0 for children will suffice. As far as my
thoughts go, this will work in planner but when translating plan tree
into execution tree we will have to produce different Execution time
Var nodes. That should be fine since we will be dealing with only a
single plan. If we could fix that somehow, well and good.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2020-10-05 13:27:09 Re: Improve choose_custom_plan for initial partition prune case
Previous Message Tomas Vondra 2020-10-05 12:23:07 Re: [HACKERS] Custom compression methods