Quick Links

Re: Consider the number of columns in the sort cost model

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc:	Kirill Reshke <reshkekirill(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject:	Re: Consider the number of columns in the sort cost model
Date:	2024-10-31 14:00:09
Message-ID:	cbdf6cd0-9fce-49e8-9b31-ef031c18c082@postgrespro.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 29.10.2024 05:47, Andrei Lepikhov wrote:
> On 10/28/24 16:48, Alena Rybakina wrote:
>> On 23.10.2024 04:39, Andrei Lepikhov wrote:
>>> On 15/10/2024 12:15, David Rowley wrote:
>>> And the last patch is a demo of how I'm going to use the previous
>>> three patches and add one more strategy to improve the order of
>>> columns in the GROUP-BY clause.
>>>
>> To be honest, I didn’t find information about this in the code, but
>> did I understand everything correctly?
> Yes
Thanks for the confirmation.
>>
>> 2. I noticed that statistics of distinct values are calculated
>> several times. Maybe I miss something, but this can be avoided if
>> these statistics can be saved after calculation. For example, I saw
>> that it is possible that you calculate the same statistic information
>> for the same equivalence members in cost_incremental_sort and
>> identify_sort_ecmember. Is it possible to store information in a
>> structure and use it later?
> Hmm, I don't see multiple calculations. em_distinct has made
> specifically for this sake. Can you provide specific case and code lines?
When you said that, I saw it. You are right, checking for 0 is what
stops us from useless recalculation. I noticed it right away.
>>
>> 3. I think you should initialize the variable ndist in your patch. I
>> faced the compile complaining during your code compilation.
>>
>> costsize.c: In function ‘identify_sort_ecmember’:
>> costsize.c:6694:42: error: ‘ndist’ may be used uninitialized
>> [-Werror=maybe-uninitialized]
>>   6694 |                         em->em_ndistinct = ndist;
>>        |                         ~~~~~~~~~~~~~~~~~^~~~~
>> costsize.c:6680:33: note: ‘ndist’ was declared here
>>   6680 |                         double ndist;
>>        |                                 ^~~
>> cc1: all warnings being treated as errors
>> gmake[4]: *** [<builtin>: costsize.o] Error 1
> I think you can just update your compiler. But I added the ndist
> initialisation to make more compilers happy :).
Thank you :)
>
>> +       Assert (node != NULL);
>> +
>>          examine_variable(root, node, 0, &vardata);
>>          if (!HeapTupleIsValid(vardata.statsTuple))
>>              continue;
> I don't think so. At least until you provide the case when the
> get_sortgroupclause_expr function returns NULL.
> That's more, remember - the patch 0004 here is just to show the
> perspective and still under construction.
> Anyway, thanks, I found out that the patch set doesn't apply correctly
> because of 828e94c. So, see the new version in the attachment.

Yes, you are right, if this case is possible, then it is connected with
an error in the formation of the target list.

I played around with the examples a bit and couldn't figure out
something. When I added the same values to different columns - firstly
in a, later in b, the order of the columns for sort operation doesn't
change. Isn't this a mistake?

create table a (x1 int, y1 int);
create table b (x2 int, y2 int);
insert into a values (NULL, NULL);
insert into a values (NULL, 1);
insert into a values (1, 1);
insert into a values (1, NULL);

create index a_x1_idx on a(x1);
create index b_x2_idx on b(x2);
create index a_y1_idx on a(y1);
create index b_y2_idx on b(y2);

insert into b select 1, 2 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 1, 3 from generate_series(3,30) as id;

explain analyze select a.x1, s.x2, s.y2 from a left join (select
distinct * from b) s on a.x1=s.x2;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actual
time=0.225..0.250 rows=8 loops=1)
   Hash Cond: (b.x2 = a.x1)
   -> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actual
time=0.117..0.123 rows=3 loops=1)
         Group Key: b.x2, b.y2
         Batches: 1 Memory Usage: 40kB
         -> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8)
(actual time=0.030..0.044 rows=48 loops=1)
   -> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074
rows=4 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual
time=0.047..0.051 rows=4 loops=1)
Planning Time: 1.649 ms
Execution Time: 0.485 ms
(11 rows)

delete from b;
insert into b select 2, 1 from generate_series(11,20) as id;
insert into b select 1, 1 from generate_series(1,10) as id;
insert into b select 3, 1 from generate_series(3,30) as id;
vacuum analyze;
explain analyze select a.x1, s.x2, s.y2 from a left join (select
distinct * from b) s on a.x1=s.x2;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual
time=0.083..0.090 rows=4 loops=1)
   Hash Cond: (a.x1 = b.x2)
   -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual
time=0.010..0.011 rows=4 loops=1)
   -> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.067..0.068
rows=3 loops=1)
         Buckets: 1024 Batches: 1 Memory Usage: 9kB
         -> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actual
time=0.063..0.064 rows=3 loops=1)
               Group Key: b.x2, b.y2
               Batches: 1 Memory Usage: 24kB
               -> Seq Scan on b (cost=0.00..1.48 rows=48 width=8)
(actual time=0.006..0.014 rows=48 loops=1)
Planning Time: 0.391 ms
Execution Time: 0.151 ms
(11 rows)

I noticed a spelling: "significance phusics".

/*
* Calculate multiplier reflecting the number of comparisons which executor
* have to perform during the sort with this specific order of columns.
*
* The comparison factor f = 1.+F(pathkeys). There 1. incapsulates the
* second-order of *significance phusics* which cost function doesn't
consider.

--
Regards,
Alena Rybakina
Postgres Professional

In response to

Re: Consider the number of columns in the sort cost model at 2024-10-29 02:47:27 from Andrei Lepikhov

Responses

Re: Consider the number of columns in the sort cost model at 2024-10-31 15:18:46 from Alena Rybakina

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Aleksander Alekseev	2024-10-31 14:02:23	Re: Having problems generating a code coverage report
Previous Message	Alexander Lakhin	2024-10-31 14:00:02	Re: Inval reliability, especially for inplace updates