Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date: 2024-11-28 02:39:18
Message-ID: 41c5c820-bf91-4070-8261-fb87e91c5913@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks to take a look!

On 11/25/24 23:45, Heikki Linnakangas wrote:
> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>> +     * proves the var is unique for this query.  However, we'd better
>> still
>> +     * believe the null-fraction statistic.
>>       */
>>      if (vardata->isunique)
>>          stadistinct = -1.0 * (1.0 - stanullfrac);
>
> I wonder about the "we'd better still believe the null-fraction
> statistic" part. It makes sense for a unique index, but a DISTINCT or
> GROUP BY collapses all the NULLs to a single row. So I think there's
> some more work to be done here.
IMO, in that particular case, it is not an issue: having GROUP-BY, we
set vardata->isunique field and disallowed to recurse into the Var
statistics inside subquery - likewise, DISTINCT already does. So, we
have stanullfrac == 0 - it means the optimiser doesn't count the number
of NULLs. In the case of the UNIQUE index, the optimiser will have the
stanullfrac statistic and count NULLs.

But your question raised one another. May we add to a node some
vardata_extra, which could count specific conditions, and let upper
nodes consider it using the Var statistic?
For example, we can separate the 'unique set of columns' knowledge in
such a structure for the Aggregate node. Also, it could be a solution to
problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
What's more, look at the query:

CREATE TABLE gu_2 (x real);
INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
VACUUM ANALYZE gu_2;

HashAggregate (cost=20.91..22.35 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> HashAggregate (cost=19.11..20.55 rows=144 width=4)
(actual rows=50 loops=1)
Group Key: gu_2.x
Batches: 1 Memory Usage: 40kB
-> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
(actual rows=149 loops=1)
Filter: ((x < '50'::double precision) OR (x IS NULL))
Rows Removed by Filter: 951

Here we also could count number of scanned NULLs separately in
vardata_extra and use it in upper GROUP-BY estimation.

--
regards, Andrei Lepikhov

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2024-11-28 03:54:32 Re: Remove useless casts to (void *)
Previous Message Robert Haas 2024-11-28 01:28:59 Re: Changing shared_buffers without restart