Re: Why don't we consider explicit Incremental Sort?

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why don't we consider explicit Incremental Sort?
Date: 2024-09-14 21:50:09
Message-ID: 42b31c39-0a6b-4bfd-a62e-86f6070c3125@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/14/24 05:50, Richard Guo wrote:
> On Fri, Sep 13, 2024 at 7:51 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> Sure, an incremental sort can improve things if things go well, no doubt
>> about that. But how significant can the improvement be, especially if we
>> need to sort everything? In your example it's ~15%, which is nice, and
>> maybe the benefits can be much larger if the incremental sort can do
>> everything in memory, without flushing to disk.
>>
>> But what about the risks? What will happen if the groups are not this
>> uniformly and independently sized, and stuff like that? Maybe it'll be
>> costed well enough, I haven't tried.
>
> I understand the concern and agree that there is a risk of regression
> if there is a skew in the number of rows per pre-sorted group.
>
> Actually there were discussions about this during the work on commit
> 4a29eabd1. Please see [1]. I will repeat David's demonstration and
> rerun his query on the current master branch to see what happens.
>
> create table ab (a int not null, b int not null);
> insert into ab select 0,x from generate_Series(1,999000)x union all
> select x%1000+1,0 from generate_Series(999001,1000000)x;
>
> create index on ab (a);
>
> analyze ab;
>
> So this is a table with a very large skew: the 0 group has 999000
> rows, and the remaining groups 1-1000 have just 1 row each.
>
> -- on master
> explain (analyze, timing off) select * from ab order by a,b;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------
> Incremental Sort (cost=2767.38..109344.55 rows=1000000 width=8)
> (actual rows=1000000 loops=1)
> Sort Key: a, b
> Presorted Key: a
> Full-sort Groups: 33 Sort Method: quicksort Average Memory: 26kB
> Peak Memory: 26kB
> Pre-sorted Groups: 1 Sort Method: external merge Average Disk:
> 17680kB Peak Disk: 17680kB
> -> Index Scan using ab_a_idx on ab (cost=0.42..22832.42
> rows=1000000 width=8) (actual rows=1000000 loops=1)
> Planning Time: 0.829 ms
> Execution Time: 1028.892 ms
> (8 rows)
>
> -- manually disable incremental sort
> set enable_incremental_sort to off;
> explain (analyze, timing off) select * from ab order by a,b;
> QUERY PLAN
> ------------------------------------------------------------------------------------------------
> Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual
> rows=1000000 loops=1)
> Sort Key: a, b
> Sort Method: external merge Disk: 17704kB
> -> Seq Scan on ab (cost=0.00..14425.00 rows=1000000 width=8)
> (actual rows=1000000 loops=1)
> Planning Time: 0.814 ms
> Execution Time: 765.198 ms
> (6 rows)
>
> Look, regression happens on current master!
> > This is a question I’ve often pondered: each time we introduce a new
> optimization, there are always potential cases where it could lead to
> regressions. What should we do about this? I kind of agree on
> David's option that, as in the commit message of 4a29eabd1:
>

Right, or as Goetz Graefe said "choice is confusion."

The funny thing is it also matters when the alternative plans are
introduced. Had it been at the very beginning, we'd consider the
behavior (including choosing sub-optimal plans) the baseline, and it'd
be okay-ish. But when we're introducing those alternative paths later,
it's more likely to be seen as a "regression" ...

> "
> That, of course, as with teaching the planner any new tricks,
> means an increased likelihood that the planner will perform an incremental
> sort when it's not the best method. Our standard escape hatch for these
> cases is an enable_* GUC.
> "
>
> I know ideally we should not rely on these enable_* GUCs, but in
> reality it seems that sometimes we do not have a better solution.
>

Right, the basic truth is there's no way to teach the optimizer to do
new stuff without a risk of regression. We simply can't do perfect
decisions based on incomplete information (which is what the statistics
are, intentionally). It is up to us to reason about the risks, and
ideally deal with that later at execution time.

I still don't think we should rely on GUCs too much for this.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-09-14 22:01:55 Re: Why don't we consider explicit Incremental Sort?
Previous Message Alena Rybakina 2024-09-14 21:31:07 Re: may be a mismatch between the construct_array and construct_md_array comments