Re: Why don't we consider explicit Incremental Sort?

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why don't we consider explicit Incremental Sort?
Date: 2024-09-14 03:50:30
Message-ID: CAMbWs49+CXsrgbq0LD1at-5jR=AHHN0YtDy9YvgXAsMfndZe-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 13, 2024 at 7:51 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Sure, an incremental sort can improve things if things go well, no doubt
> about that. But how significant can the improvement be, especially if we
> need to sort everything? In your example it's ~15%, which is nice, and
> maybe the benefits can be much larger if the incremental sort can do
> everything in memory, without flushing to disk.
>
> But what about the risks? What will happen if the groups are not this
> uniformly and independently sized, and stuff like that? Maybe it'll be
> costed well enough, I haven't tried.

I understand the concern and agree that there is a risk of regression
if there is a skew in the number of rows per pre-sorted group.

Actually there were discussions about this during the work on commit
4a29eabd1. Please see [1]. I will repeat David's demonstration and
rerun his query on the current master branch to see what happens.

create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,1000000)x;

create index on ab (a);

analyze ab;

So this is a table with a very large skew: the 0 group has 999000
rows, and the remaining groups 1-1000 have just 1 row each.

-- on master
explain (analyze, timing off) select * from ab order by a,b;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Incremental Sort (cost=2767.38..109344.55 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Sort Key: a, b
Presorted Key: a
Full-sort Groups: 33 Sort Method: quicksort Average Memory: 26kB
Peak Memory: 26kB
Pre-sorted Groups: 1 Sort Method: external merge Average Disk:
17680kB Peak Disk: 17680kB
-> Index Scan using ab_a_idx on ab (cost=0.42..22832.42
rows=1000000 width=8) (actual rows=1000000 loops=1)
Planning Time: 0.829 ms
Execution Time: 1028.892 ms
(8 rows)

-- manually disable incremental sort
set enable_incremental_sort to off;
explain (analyze, timing off) select * from ab order by a,b;
QUERY PLAN
------------------------------------------------------------------------------------------------
Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual
rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 17704kB
-> Seq Scan on ab (cost=0.00..14425.00 rows=1000000 width=8)
(actual rows=1000000 loops=1)
Planning Time: 0.814 ms
Execution Time: 765.198 ms
(6 rows)

Look, regression happens on current master!

This is a question I’ve often pondered: each time we introduce a new
optimization, there are always potential cases where it could lead to
regressions. What should we do about this? I kind of agree on
David's option that, as in the commit message of 4a29eabd1:

"
That, of course, as with teaching the planner any new tricks,
means an increased likelihood that the planner will perform an incremental
sort when it's not the best method. Our standard escape hatch for these
cases is an enable_* GUC.
"

I know ideally we should not rely on these enable_* GUCs, but in
reality it seems that sometimes we do not have a better solution.

[1] https://postgr.es/m/CAApHDvr1Sm+g9hbv4REOVuvQKeDWXcKUAhmbK5K+dfun0s9CvA@mail.gmail.com

Thanks
Richard

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Richard Guo 2024-09-14 03:37:21 Re: Why don't we consider explicit Incremental Sort?