From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Andrei Lepikhov <lepihov(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Improve statistics estimation considering GROUP-BY as a 'uniqueiser' |
Date: | 2024-11-25 16:45:48 |
Message-ID: | 61d7cb95-9194-42cf-8225-351703de9cb8@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 24/09/2024 08:08, Andrei Lepikhov wrote:
> On 19/9/2024 09:55, Andrei Lepikhov wrote:
>> This wrong prediction makes things much worse if the query has more
>> upper query blocks.
>> His question was: Why not consider the grouping column unique in the
>> upper query block? It could improve estimations.
>> After a thorough investigation, I discovered that in commit
>> 4767bc8ff2 most of the work was already done for DISTINCT clauses. So,
>> why not do the same for grouping? A sketch of the patch is attached.
>> As I see it, grouping in this sense works quite similarly to DISTINCT,
>> and we have no reason to ignore it. After applying the patch, you can
>> see that prediction has been improved:
>>
>> Hash Right Join (cost=5.62..162.56 rows=50 width=36)
>>
> A regression test is added into new version.
> The code looks tiny, simple and non-invasive - it will be easy to commit
> or reject. So I add it to next commitfest.
Looks good at a quick glance.
> @@ -5843,11 +5852,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
> }
>
> /*
> - * If there is a unique index or DISTINCT clause for the variable, assume
> - * it is unique no matter what pg_statistic says; the statistics could be
> - * out of date, or we might have found a partial unique index that proves
> - * the var is unique for this query. However, we'd better still believe
> - * the null-fraction statistic.
> + * If there is a unique index, DISTINCT or GROUP-BY clause for the variable,
> + * assume it is unique no matter what pg_statistic says; the statistics
> + * could be out of date, or we might have found a partial unique index that
> + * proves the var is unique for this query. However, we'd better still
> + * believe the null-fraction statistic.
> */
> if (vardata->isunique)
> stadistinct = -1.0 * (1.0 - stanullfrac);
I wonder about the "we'd better still believe the null-fraction
statistic" part. It makes sense for a unique index, but a DISTINCT or
GROUP BY collapses all the NULLs to a single row. So I think there's
some more work to be done here.
--
Heikki Linnakangas
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Kirill Reshke | 2024-11-25 16:52:17 | Re: Truncate logs by max_log_size |
Previous Message | Fujii Masao | 2024-11-25 16:32:10 | Re: Fix for Extra Parenthesis in pgbench progress message |