Quick Links

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From:	Vlada Pogozhelskaya <v(dot)pogozhelskaya(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date:	2025-02-18 15:07:09
Message-ID:	b6ad2bb9-94c8-417b-b795-28c1fa9f4793@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi all,

Following the discussion on improving statistics estimation by
considering GROUP BY as a unique constraint, I’ve prepared a patch that
integrates GROUP BY into cardinality estimation in a similar way to
DISTINCT.

This patch ensures that when a subquery contains a GROUP BY clause, the
optimizer recognizes the grouped columns as unique. The logic follows a
straightforward approach, comparing the GROUP BY columns with the target
list to determine uniqueness.

I’d appreciate any feedback or suggestions for further improvements.

---
regards,
Vlada Pogozhelskaya

On 17.02.2025 08:06, Alexander Korotkov wrote:
> On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov<lepihov(at)gmail(dot)com> wrote:
>> Thanks to take a look!
>>
>> On 11/25/24 23:45, Heikki Linnakangas wrote:
>>> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>>>> + * proves the var is unique for this query. However, we'd better
>>>> still
>>>> + * believe the null-fraction statistic.
>>>> */
>>>> if (vardata->isunique)
>>>> stadistinct = -1.0 * (1.0 - stanullfrac);
>>> I wonder about the "we'd better still believe the null-fraction
>>> statistic" part. It makes sense for a unique index, but a DISTINCT or
>>> GROUP BY collapses all the NULLs to a single row. So I think there's
>>> some more work to be done here.
>> IMO, in that particular case, it is not an issue: having GROUP-BY, we
>> set vardata->isunique field and disallowed to recurse into the Var
>> statistics inside subquery - likewise, DISTINCT already does. So, we
>> have stanullfrac == 0 - it means the optimiser doesn't count the number
>> of NULLs. In the case of the UNIQUE index, the optimiser will have the
>> stanullfrac statistic and count NULLs.
> This sounds convincing.
>
>> But your question raised one another. May we add to a node some
>> vardata_extra, which could count specific conditions, and let upper
>> nodes consider it using the Var statistic?
>> For example, we can separate the 'unique set of columns' knowledge in
>> such a structure for the Aggregate node. Also, it could be a solution to
>> problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
>> What's more, look at the query:
>>
>> CREATE TABLE gu_2 (x real);
>> INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
>> INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
>> VACUUM ANALYZE gu_2;
>>
>> HashAggregate (cost=20.91..22.35 rows=144 width=4)
>> (actual rows=50 loops=1)
>> Group Key: gu_2.x
>> Batches: 1 Memory Usage: 40kB
>> -> HashAggregate (cost=19.11..20.55 rows=144 width=4)
>> (actual rows=50 loops=1)
>> Group Key: gu_2.x
>> Batches: 1 Memory Usage: 40kB
>> -> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
>> (actual rows=149 loops=1)
>> Filter: ((x < '50'::double precision) OR (x IS NULL))
>> Rows Removed by Filter: 951
>>
>> Here we also could count number of scanned NULLs separately in
>> vardata_extra and use it in upper GROUP-BY estimation.
> What could be the type of vardata_extra? And what information could
> it store? Yet seems too sketchy for me to understand.
>
> But, I think for now we should go with the original patch. It seems
> to be quite straightforward extension to what 4767bc8ff2 does. I've
> revised commit message and applied pg_indent to sources. I'm going to
> push this if no objections.
>
> ------
> Regards,
> Alexander Korotkov
> Supabase

Attachment	Content-Type	Size
v3-0001-Improve-statistics-estimation-for-GROUP-BY.patch	text/plain	2.0 KB

In response to

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' at 2025-02-17 01:06:15 from Alexander Korotkov

Responses

Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' at 2025-02-19 09:48:04 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-02-18 15:13:03	Re: Clarification on Role Access Rights to Table Indexes
Previous Message	Andrew Dunstan	2025-02-18 14:57:42	Re: New "single" COPY format