From: | Manuel Weitzman <manuelweitzman(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Jeremy Schneider <schnjere(at)amazon(dot)com>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Planning performance problem (67626.278ms) |
Date: | 2021-06-30 20:56:14 |
Message-ID: | E4AB7AC2-596B-4B60-A5D4-D4BD48F9DE55@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
> 1. create_join_clause doesn't trouble to look for commuted
> equivalents, which perhaps is penny-wise and pound-foolish.
> The cost of re-deriving selectivity estimates could be way
> more than the cost of checking this.
Agreed.
> 2. Although these look like they ought to be equivalent to the
> original clauses (modulo commutation, for some of them), they don't
> look that way to create_join_clause, because it's also checking
> for parent_ec equality. Per the comment,
>
> * parent_ec is either equal to ec (if the clause is a potentially-redundant
> * join clause) or NULL (if not). We have to treat this as part of the
> * match requirements --- it's possible that a clause comparing the same two
> * EMs is a join clause in one join path and a restriction clause in another.
>
> It might be worth digging into the git history to see why that
> became a thing and then considering whether there's a way around it.
> (I'm pretty sure that comment is mine, but I don't recall the details
> anymore.)
To me that sounds OK, I cannot prove that they're equivalent to the
original clauses so I think it is fine to assume they're not (not an
expert here, quite the opposite).
> Anyway, it's certainly not the case that we're making new
> RestrictInfos for every pair of rels. It looks that way in this
> example because the join vars all belong to the same EC, but
> that typically wouldn't be the case in more complex queries.
Good to know, this wasn't clear to me.
> So we could look into whether this code can be improved to share
> RestrictInfos across more cases. Another thought is that even
> if we need to keep original and derived clauses separate, maybe it'd
> be all right to copy previously-determined cached selectivity values
> from an original clause to an otherwise-identical derived clause
> (cf. commute_restrictinfo()). I'm not sure though whether it's
> reliably the case that we'd have filled in selectivities for the
> original clauses before this code wants to clone them.
To be honest, even if that sounds like a good idea to dig on, I think
it wouldn't completely solve the problem with repeated calls to
get_actual_variable_range().
The example query I gave is doing a lot of simple auto-joins which
makes the thought process simpler, but I worry more about the more
"common" case in which there is more than 2 distinct tables involved
in the query
For example, instead of having "b1, b2, ..., bn" as aliases of "b" in
this query
>> explain (analyze, buffers)
>> select * from a
>> join b b1 on (b1.a = a.a)
>> join b b2 on (b2.a = a.a)
>> where b1.a in (1,100,10000,1000000,1000001);
it is also possible to reproduce the increasing cost in planning
buffers for each new join on a distinct table being added:
explain (analyze, buffers)
select * from a
join b on (b.a = a.a)
join c on (c.a = a.a)
-- ... (etc)
where c.a in (1,100,10000,1000000,1000001);
I can imagine that deconstruct_jointree() and
generate_join_implied_equalities() would generate multiple
RestrictInfos, in which many of them a constraint on a.a would be
involved (each involving a different table).
b.a = a.a
c.a = a.a
c.a = b.a
a.a = b.a
a.a = c.a
... (etc)
(if we wanted, we could also add a different WHERE clause on each of
the tables involved to make really sure all RestrictInfos are
different).
For each of these RestrictInfos there *could* be one cache miss on
cached_scansel() that *could* force the planner to compute
get_actual_variable_range() for the same variable (a.a) over and over,
as mergejoinscansel() always computes the selectivity for the
intervals that require actual extremal values. In practice this
re-computing of the variable range seems to happen a lot.
One way in which I see possible to share this kind of information (of
extremal values) across RestrictInfos is to store the known variable
ranges in PlannerInfo (or within a member of such struct), which seems
to be around everywhere it would be needed.
Best regards,
Manuel
From | Date | Subject | |
---|---|---|---|
Next Message | Ayub Khan | 2021-07-01 16:29:31 | Re: slow performance with cursor |
Previous Message | Tom Lane | 2021-06-29 22:31:56 | Re: Planning performance problem (67626.278ms) |