From: | Vlada Pogozhelskaya <v(dot)pogozhelskaya(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' |
Date: | 2025-02-18 15:07:09 |
Message-ID: | b6ad2bb9-94c8-417b-b795-28c1fa9f4793@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi all,
Following the discussion on improving statistics estimation by
considering GROUP BY as a unique constraint, I’ve prepared a patch that
integrates GROUP BY into cardinality estimation in a similar way to
DISTINCT.
This patch ensures that when a subquery contains a GROUP BY clause, the
optimizer recognizes the grouped columns as unique. The logic follows a
straightforward approach, comparing the GROUP BY columns with the target
list to determine uniqueness.
I’d appreciate any feedback or suggestions for further improvements.
---
regards,
Vlada Pogozhelskaya
On 17.02.2025 08:06, Alexander Korotkov wrote:
> On Thu, Nov 28, 2024 at 4:39 AM Andrei Lepikhov<lepihov(at)gmail(dot)com> wrote:
>> Thanks to take a look!
>>
>> On 11/25/24 23:45, Heikki Linnakangas wrote:
>>> On 24/09/2024 08:08, Andrei Lepikhov wrote:
>>>> + * proves the var is unique for this query. However, we'd better
>>>> still
>>>> + * believe the null-fraction statistic.
>>>> */
>>>> if (vardata->isunique)
>>>> stadistinct = -1.0 * (1.0 - stanullfrac);
>>> I wonder about the "we'd better still believe the null-fraction
>>> statistic" part. It makes sense for a unique index, but a DISTINCT or
>>> GROUP BY collapses all the NULLs to a single row. So I think there's
>>> some more work to be done here.
>> IMO, in that particular case, it is not an issue: having GROUP-BY, we
>> set vardata->isunique field and disallowed to recurse into the Var
>> statistics inside subquery - likewise, DISTINCT already does. So, we
>> have stanullfrac == 0 - it means the optimiser doesn't count the number
>> of NULLs. In the case of the UNIQUE index, the optimiser will have the
>> stanullfrac statistic and count NULLs.
> This sounds convincing.
>
>> But your question raised one another. May we add to a node some
>> vardata_extra, which could count specific conditions, and let upper
>> nodes consider it using the Var statistic?
>> For example, we can separate the 'unique set of columns' knowledge in
>> such a structure for the Aggregate node. Also, it could be a solution to
>> problem of counting nulls, generated by RHS of OUTER JOINs in query tree.
>> What's more, look at the query:
>>
>> CREATE TABLE gu_2 (x real);
>> INSERT INTO gu_2 (x) SELECT gs FROM generate_series(1,1000) AS gs;
>> INSERT INTO gu_2 (x) SELECT NULL FROM generate_series(1,100) AS gs;
>> VACUUM ANALYZE gu_2;
>>
>> HashAggregate (cost=20.91..22.35 rows=144 width=4)
>> (actual rows=50 loops=1)
>> Group Key: gu_2.x
>> Batches: 1 Memory Usage: 40kB
>> -> HashAggregate (cost=19.11..20.55 rows=144 width=4)
>> (actual rows=50 loops=1)
>> Group Key: gu_2.x
>> Batches: 1 Memory Usage: 40kB
>> -> Seq Scan on gu_2 (cost=0.00..18.75 rows=145 width=4)
>> (actual rows=149 loops=1)
>> Filter: ((x < '50'::double precision) OR (x IS NULL))
>> Rows Removed by Filter: 951
>>
>> Here we also could count number of scanned NULLs separately in
>> vardata_extra and use it in upper GROUP-BY estimation.
> What could be the type of vardata_extra? And what information could
> it store? Yet seems too sketchy for me to understand.
>
> But, I think for now we should go with the original patch. It seems
> to be quite straightforward extension to what 4767bc8ff2 does. I've
> revised commit message and applied pg_indent to sources. I'm going to
> push this if no objections.
>
> ------
> Regards,
> Alexander Korotkov
> Supabase
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Improve-statistics-estimation-for-GROUP-BY.patch | text/plain | 2.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2025-02-18 15:13:03 | Re: Clarification on Role Access Rights to Table Indexes |
Previous Message | Andrew Dunstan | 2025-02-18 14:57:42 | Re: New "single" COPY format |