Quick Links

Re: Consider the number of columns in the sort cost model

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Andrei Lepikhov <lepihov(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Kirill Reshke <reshkekirill(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Consider the number of columns in the sort cost model
Date:	2024-10-28 09:48:19
Message-ID:	0870a8c4-4579-4428-bacb-c604463ab2ae@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi! Thank you for your work on this subject!

On 23.10.2024 04:39, Andrei Lepikhov wrote:
> On 15/10/2024 12:15, David Rowley wrote:
>> On Tue, 15 Oct 2024 at 17:48, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>> I think maybe what is worth working on is seeing if you can better
>> estimate the number of tiebreak comparisons by checking the number of
>> distinct values for each sort key. That might require what we
>> discussed on another thread about having estimate_num_groups() better
>> use the n_distinct estimate from the EquivalenceMember with the least
>> distinct values. It'll be another question if all that can be done
>> and this all still perform well enough for cost_sort(). You may have
>> to write it first before we can figure that out. It may be possible
>> to buy back some of the overheads with some additional caching...
>> Perhaps that could be done within the EquivalenceClass.
> It seems that your idea with an equivalence class works.
> In the patchset attached, I show all the ideas (I initially wanted to
> discuss it step-by-step).
> In the attached, the first patch is about choosing adequate expression
> from the list of EC members.
> The second one is the current patch, which considers the number of
> sort columns.
> Third patch - elaboration of the second patch. Use only the trustful
> statistics here - the number of ndistinct values on the first sorted
> column.
> And the last patch is a demo of how I'm going to use the previous
> three patches and add one more strategy to improve the order of
> columns in the GROUP-BY clause.
>
I have some question about your patches.

1. You consider two lists root->group_pathkeys and
root->processed_groupClause.

As far as I understand, this is due to the fact that the pathkey
elements are identical to the group clause and identity verification is
precisely checked by this condition:

if (foreach_current_index(lc1) >= root->num_groupby_pathkeys)

To be honest, I didn’t find information about this in the code, but did
I understand everything correctly?

2. I noticed that statistics of distinct values are calculated several
times. Maybe I miss something, but this can be avoided if these
statistics can be saved after calculation. For example, I saw that it is
possible that you calculate the same statistic information for the same
equivalence members in cost_incremental_sort and identify_sort_ecmember.
Is it possible to store information in a structure and use it later?

3. I think you should initialize the variable ndist in your patch. I
faced the compile complaining during your code compilation.

costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized
[-Werror=maybe-uninitialized]
6694 |                         em->em_ndistinct = ndist;
      |                         ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
6680 |                         double ndist;
      |                                 ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1

4. Before executing examine_variable function I think you should check
that it is not null.

@@ -575,6 +575,8 @@ get_useful_group_keys_orderings(PlannerInfo *root,
Path *path)
*/
continue;

+       Assert (node != NULL);
+
        examine_variable(root, node, 0, &vardata);
        if (!HeapTupleIsValid(vardata.statsTuple))
            continue;

--
Regards,
Alena Rybakina
Postgres Professional

In response to

Re: Consider the number of columns in the sort cost model at 2024-10-23 01:39:38 from Andrei Lepikhov

Responses

Re: Consider the number of columns in the sort cost model at 2024-10-29 02:47:27 from Andrei Lepikhov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joel Jacobson	2024-10-28 10:20:57	Re: New "raw" COPY format
Previous Message	Peter Eisentraut	2024-10-28 09:45:21	Re: further #include cleanup (IWYU)