Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-08-11 15:13:34
Message-ID: 6ef82d17-2d37-21ff-b00e-320927851b0e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/11/21 2:08 PM, Dean Rasheed wrote:
> On Wed, 11 Aug 2021 at 00:05, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> So with the statistics, the estimate gets a bit worse. The reason is
>> fairly simple - if you look at the two parts of the OR clause, we get this:
>>
>> clause actual no stats with stats
>> ---------------------------------------------------------------
>> (A < B and A <> A) 0 331667 1
>> not (A < A) 1000000 333333 333333
>>
>> This clearly shows that the first clause is clearly improved, while the
>> (A < A) is estimated the same way, because the clause has a single Var
>> so it's considered to be "simple" so we ignore the MCV selectivity and
>> just use the simple_sel calculated by clause_selectivity_ext.
>>
>> And the 333333 and 331667 just happen to be closer to the actual row
>> count. But that's mostly by luck, clearly.
>>
>> But now that I think about it, maybe the problem really is in how
>> statext_mcv_clauselist_selectivity treats this clause - the definition
>> of "simple" clauses as "has one attnum" was appropriate when only
>> clauses (Var op Const) were supported. But with (Var op Var) that's
>> probably not correct anymore.
>>
>
> Hmm, interesting. Clearly the fact that the combined estimate without
> extended stats was better was just luck, based on it's large
> overestimate of the first clause. But it's also true that a (Var op
> Var) clause should not be treated as simple, because "simple" in this
> context is meant to be for clauses that are likely to be better
> estimated with regular stats, whereas in this case, extended stats
> would almost certainly do better on the second clause.

I don't see why extended stats would do better on the second clause. I
mean, if you have (A < A) then extended stats pretty much "collapse"
into per-column stats. We could get almost the same estimate on
single-column MCV list, etc. The reason why that does not happen is that
we just treat it as a range clause, and assign it a default 33% estimate.

But we could make that a bit smarter, and assign better estimates to
those clauses

(A < A) => 0.0
(A = A) => 1.0
(A <= A) => 1.0

And that'd give us the same estimates, I think. Not sure that's worth
it, because (A op A) clauses are probably very rare, OTOH it's cheap.

>
> Perhaps the easiest way to identify simple clauses would be in
> statext_is_compatible_clause(), rather than the way it's done now,
> because it has the relevant information at hand, so it could be made
> to return an extra flag.
>

Agreed, that seems like a better place to fix this.

> This feels like rather an artificial example though. Is there any real
> use for this sort of clause?
>

True. It seems a bit artificial, which is understandable as it came from
a synthetic test generating all possible clauses. OTOH, fixing it seems
fairly cheap ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-08-11 15:17:11 Re: Use extended statistics to estimate (Var op Var) clauses
Previous Message Robert Haas 2021-08-11 14:54:09 Re: Next Steps with Hash Indexes