Quick Links

Re: using extended statistics to improve join estimates

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: using extended statistics to improve join estimates
Date:	2021-12-13 18:25:04
Message-ID:	14bedb1c-1feb-2613-96e4-2426deb5d34d@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/6/21 11:03, Andy Fan wrote:
> Hi Tomas:
>
> This is the exact patch I want, thanks for the patch!
>

Good to hear.

> On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>> 3) estimation by join pairs
>>
>> At the moment, the estimates are calculated for pairs of relations, so
>> for example given a query
>>
>> explain analyze
>> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>> join t3 on (t1.b = t3.b and t2.c = t3.c);
>>
>> we'll estimate the first join (t1,t2) just fine, but then the second
>> join actually combines (t1,t2,t3). What the patch currently does is it
>> splits it into (t1,t2) and (t2,t3) and estimates those.
>
> Actually I can't understand how this works even for a simpler example.
> let's say we query like this (ONLY use t2's column to join t3).
>
> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
> join t3 on (t2.c = t3.c and t2.d = t3.d);
>
> Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes
> to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it
> is hard to
> work. Here I see your solution is splitting it into (t1, t2) AND (t2,
> t3) and estimate
> those. But how does this help to estimate the size of JoinRel(t1, t2, t3)?
>

Yeah, this is really confusing. The crucial thing to keep in mind is
this works with clauses before running setrefs.c, so the clauses
reference the original relations - *not* the join relation. Otherwise
even the regular estimation would not work, because where would it get
the per-column MCV lists etc.

Let's use a simple case with join clauses referencing just a single
attribute for each pair or relations. And let's talk about how many join
pairs it'll extract:

t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but the join clause references just 2 rels, so
1 join pair (t1,t3)

Now a more complicated case, with more complex join clause

t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c)

=> first we join t1/t2, which is 1 join pair (t1,t2)
=> then we join t1/t2/t3, but this time the join clause references all
three rels, so we have two join pairs (t1,t3) and (t2,t3) and we can use
all the statistics.

>> I wonder if this
>> should actually combine all three MCVs at once - we're pretty much
>> combining the MCVs into one large MCV representing the join result.
>>
>
> I guess we can keep the MCVs on joinrel for these matches. Take the above
> query I provided for example, and suppose the MCV data as below:
>
> t1(a, b)
> (1, 2) -> 0.1
> (1, 3) -> 0.2
> (2, 3) -> 0.5
> (2, 8) -> 0.1
>
> t2(a, b)
> (1, 2) -> 0.2
> (1, 3) -> 0.1
> (2, 4) -> 0.2
> (2, 10) -> 0.1
>
> After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below
>
> (1, 2, 1, 2) -> 0.1 * 0.2
> (1, 3, 1, 3) -> 0.2 * 0.1
>
> And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
> (0.2 + 0.1 + 0.2 + 0.1)
>

Right, that's about the joint distribution I whole join.

> With this design, the nitems of MCV on joinrel would be less than
> either of baserel.
>

Actually, I think the number of items can grow, because the matches may
duplicate some items. For example in your example with (t1.a = t2.a) the
first first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for
(1,3) in t1. So that's 4 combinations. Of course, we could aggregate the
MCV by ignoring columns not used in the query.

> and since we handle the eqjoin as well, we even can record the items as
>
> (1, 2) -> 0.1 * 0.2
> (1, 3) -> 0.2 * 0.1;
>
> About when we should maintain the JoinRel's MCV data, rather than
> maintain this just
> after the JoinRel size is estimated, we can only estimate it when it
> is needed. for example:
>
> select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
> join t3 on (t2.c = t3.c and t2.d = t3.d);
>
> we don't need to maintain the MCV on (t1, t2, t3) since no others
> need it at all. However
> I don't check code too closely to see if it (Lazing computing MVC on
> joinrel) is convenient
> to do.
>

I'm not sure I understand what you're proposing here.

However, I think that estimating it for pairs has two advantages:

1) Combining MCVs for k relations requires k for loops. Processing 2
relations at a time limits the amount of CPU we need. Of course, this
assumes the joins are independent, which may or may not be true.

2) It seems fairly easy to combine different types of statistics
(regular, extended, ...), and also consider the part not represented by
MCV. It seems much harder when joining more than 2 relations.

I'm also worried about amplification of errors - I suspect attempting to
build the joint MCV for the whole join relation may produce significant
estimation errors.

Furthermore, I think joins with clauses referencing more than just two
relations are fairly uncommon. And we can always improve the feature in
this direction in the future.

>
>> But I haven't done that yet, as it requires the MCVs to be combined
>> using the join clauses (overlap in a way), but I'm not sure how likely
>> that is in practice. In the example it could help, but that's a bit
>> artificial example.
>>
>>
>> 4) still just inner equi-joins
>>
>> I haven't done any work on extending this to outer joins etc. Adding
>> outer and semi joins should not be complicated, mostly copying and
>> tweaking what eqjoinsel() does.
>>
>
> Overall, thanks for the feature and I am expecting there are more cases
> to handle during discussion. To make the review process more efficient,
> I suggest that we split the patch into smaller ones and review/commit them
> separately if we have finalized the design roughly . For example:
>
> Patch 1 -- required both sides to have extended statistics.
> Patch 2 -- required one side to have extended statistics and the other side had
> per-column MCV.
> Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const;
> Patch 3 -- handle the case for 3+ table joins.
> Patch 4 -- supports the outer join.
>
> I think we can do this if we are sure that each individual patch would work in
> some cases and would not make any other case worse. If you agree with this,
> I can do that splitting work during my review process.
>

I'll consider splitting it like this, but I'm not sure it makes the main
patch that much smaller.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: using extended statistics to improve join estimates at 2021-11-06 10:03:23 from Andy Fan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bossart, Nathan	2021-12-13 18:30:43	Re: O(n) tasks cause lengthy startups and checkpoints
Previous Message	Bossart, Nathan	2021-12-13 18:21:02	Re: O(n) tasks cause lengthy startups and checkpoints