Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date: 2025-02-17 01:06:15
Message-ID: CAPpHfduZbgaTvsE4xhS3NNFviDfDQpC1OCiyAEYEti0OpC5spA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
> Thanks to take a look!
>
> On 11/25/24 23:45, Heikki Linnakangas wrote:
> > On 24/09/2024 08:08, Andrei Lepikhov wrote:
> >> + * proves the var is unique for this query. However, we'd better
> >> still
> >> + * believe the null-fraction statistic.
> >> */
> >> if (vardata->isunique)
> >> stadistinct = -1.0 * (1.0 - stanullfrac);
> >
> > I wonder about the "we'd better still believe the null-fraction
> > statistic" part. It makes sense for a unique index, but a DISTINCT or
> > GROUP BY collapses all the NULLs to a single row. So I think there's
> > some more work to be done here.
> IMO, in that particular case, it is not an issue: having GROUP-BY, we
> set vardata->isunique field and disallowed to recurse into the Var
> statistics inside subquery - likewise, DISTINCT already does. So, we
> have stanullfrac == 0 - it means the optimiser doesn't count the number
> of NULLs. In the case of the UNIQUE index, the optimiser will have the
> stanullfrac statistic and count NULLs.

This sounds convincing.

> But your question raised one another. May we add to a node some
> vardata_extra, which could count specific conditions, and let upper
> nodes consider it using the Var statistic?
> For example, we can separate the 'unique set of columns' knowledge in
> such a structure for the Aggregate node. Also, it could be a solution to
> problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
> What's more, look at the query:
>
> CREATE TABLE gu_2 (x real);
> INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
> INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
> VACUUM ANALYZE gu_2;
>
> HashAggregate (cost=20.91..22.35 rows=144 width=4)
> (actual rows=50 loops=1)
> Group Key: gu_2.x
> Batches: 1 Memory Usage: 40kB
> -> HashAggregate (cost=19.11..20.55 rows=144 width=4)
> (actual rows=50 loops=1)
> Group Key: gu_2.x
> Batches: 1 Memory Usage: 40kB
> -> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
> (actual rows=149 loops=1)
> Filter: ((x < '50'::double precision) OR (x IS NULL))
> Rows Removed by Filter: 951
>
> Here we also could count number of scanned NULLs separately in
> vardata_extra and use it in upper GROUP-BY estimation.

What could be the type of vardata_extra? And what information could
it store? Yet seems too sketchy for me to understand.

But, I think for now we should go with the original patch. It seems
to be quite straightforward extension to what 4767bc8ff2 does. I've
revised commit message and applied pg_indent to sources. I'm going to
push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Attachment Content-Type Size
v2-0001-Improve-statistics-estimation-for-single-column-G.patch application/octet-stream 6.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Alsup 2025-02-17 01:21:17 Re: UUID v7
Previous Message Tom Lane 2025-02-17 00:53:06 Re: Add pg_accept_connections_start_time() for better uptime calculation