From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Use extended statistics to estimate (Var op Var) clauses |
Date: | 2021-08-11 15:13:34 |
Message-ID: | 6ef82d17-2d37-21ff-b00e-320927851b0e@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/11/21 2:08 PM, Dean Rasheed wrote:
> On Wed, 11 Aug 2021 at 00:05, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> So with the statistics, the estimate gets a bit worse. The reason is
>> fairly simple - if you look at the two parts of the OR clause, we get this:
>>
>> clause actual no stats with stats
>> ---------------------------------------------------------------
>> (A < B and A <> A) 0 331667 1
>> not (A < A) 1000000 333333 333333
>>
>> This clearly shows that the first clause is clearly improved, while the
>> (A < A) is estimated the same way, because the clause has a single Var
>> so it's considered to be "simple" so we ignore the MCV selectivity and
>> just use the simple_sel calculated by clause_selectivity_ext.
>>
>> And the 333333 and 331667 just happen to be closer to the actual row
>> count. But that's mostly by luck, clearly.
>>
>> But now that I think about it, maybe the problem really is in how
>> statext_mcv_clauselist_selectivity treats this clause - the definition
>> of "simple" clauses as "has one attnum" was appropriate when only
>> clauses (Var op Const) were supported. But with (Var op Var) that's
>> probably not correct anymore.
>>
>
> Hmm, interesting. Clearly the fact that the combined estimate without
> extended stats was better was just luck, based on it's large
> overestimate of the first clause. But it's also true that a (Var op
> Var) clause should not be treated as simple, because "simple" in this
> context is meant to be for clauses that are likely to be better
> estimated with regular stats, whereas in this case, extended stats
> would almost certainly do better on the second clause.
I don't see why extended stats would do better on the second clause. I
mean, if you have (A < A) then extended stats pretty much "collapse"
into per-column stats. We could get almost the same estimate on
single-column MCV list, etc. The reason why that does not happen is that
we just treat it as a range clause, and assign it a default 33% estimate.
But we could make that a bit smarter, and assign better estimates to
those clauses
(A < A) => 0.0
(A = A) => 1.0
(A <= A) => 1.0
And that'd give us the same estimates, I think. Not sure that's worth
it, because (A op A) clauses are probably very rare, OTOH it's cheap.
>
> Perhaps the easiest way to identify simple clauses would be in
> statext_is_compatible_clause(), rather than the way it's done now,
> because it has the relevant information at hand, so it could be made
> to return an extra flag.
>
Agreed, that seems like a better place to fix this.
> This feels like rather an artificial example though. Is there any real
> use for this sort of clause?
>
True. It seems a bit artificial, which is understandable as it came from
a synthetic test generating all possible clauses. OTOH, fixing it seems
fairly cheap ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Mark Dilger | 2021-08-11 15:17:11 | Re: Use extended statistics to estimate (Var op Var) clauses |
Previous Message | Robert Haas | 2021-08-11 14:54:09 | Re: Next Steps with Hash Indexes |