Re: [HACKERS] [PATCH] Incremental sort

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [PATCH] Incremental sort
Date: 2018-03-10 17:05:25
Message-ID: CAPpHfdtHggE3gK1xytjk4QV1skXoeO2DHtcw37tBA5H+ew7N5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 10, 2018 at 6:42 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> On Thu, Mar 8, 2018 at 2:49 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> > wrote:
> Thank you very much for testing and benchmarking. I'll investigate
> the regressions you found.
>
>
>> Now, there's a caveat in those tests - the data set is synthetic and
>> perfectly random, i.e. all groups equally likely, no correlations or
>> anything like that.
>>
>> I wonder what is the "worst case" scenario, i.e. how to construct a data
>> set with particularly bad behavior of the incremental sort.
>
>
> I think that depends on the reason of bad behavior of incremental sort.
> For example, our quick sort implementation behaves very good on
> presorted data. But, incremental sort appears to be not so good in
> this case as Heikki showed upthread. That prompted me to test
> presorted datasets (which appeared to be "worst case") more intensively.
> But I suspect that regressions you found have another reason, and
> correspondingly "worst case" would be also different.
> When I'll investigate the reason of regressions, I'll try to construct
> "worst case" as well.
>

After some investigation of benchmark results, I found 2 sources of
regressions of incremental sort.

*Case 1: Underlying node scan lose is bigger than incremental sort win*

===== 33 [Wed Mar 7 10:14:14 CET 2018] scale:10000000 groups:10
work_mem:64MB incremental:on max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1588080.84..1588080.84 rows=1 width=20) (actual
time=5874.527..5874.527 rows=0 loops=1)
-> Incremental Sort (cost=119371.51..1488081.45 rows=9999939 width=20)
(actual time=202.842..5653.224 rows=10000000 loops=1)
Sort Key: s_1.a, s_1.b
Presorted Key: s_1.a
Sort Method: external merge Disk: 29408kB
Sort Groups: 11
-> Index Scan using s_1_a_idx on s_1 (cost=0.43..323385.52
rows=9999939 width=20) (actual time=0.051..1494.105 rows=10000000 loops=1)
Planning time: 0.269 ms
Execution time: 5877.367 ms
(9 rows)

===== 37 [Wed Mar 7 10:15:51 CET 2018] scale:10000000 groups:10
work_mem:64MB incremental:off max_workers:0 =====
SELECT * FROM s_1 ORDER BY a, b
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1656439.93..1656439.93 rows=1 width=20) (actual
time=4741.716..4741.716 rows=0 loops=1)
-> Sort (cost=1531440.69..1556440.54 rows=9999939 width=20) (actual
time=3522.156..4519.278 rows=10000000 loops=1)
Sort Key: s_1.a, s_1.b
Sort Method: external merge Disk: 293648kB
-> Seq Scan on s_1 (cost=0.00..163694.39 rows=9999939 width=20)
(actual time=0.021..650.322 rows=10000000 loops=1)
Planning time: 0.249 ms
Execution time: 4777.088 ms
(7 rows)

In this case optimizer have decided that "Index Scan + Incremental Sort"
would be
cheaper than "Seq Scan + Sort". But it appears that the amount of time we
loose by
selecting Index Scan over Seq Scan is bigger than amount of time we win by
selecting Incremental Sort over Sort. I would note that regular Sort
consumes
about 10X more disk space. I bet that all this space has fit to OS cache
of test
machine. But optimizer did expect actual IO to take place in this case.
This
has lead actual time to be inadequate the costing.

*Case 2: Underlying node is not parallelyzed*

===== 178 [Wed Mar 7 11:18:53 CET 2018] scale:10000000 groups:100
work_mem:8MB incremental:on max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1179047.88..1179047.88 rows=1 width=20) (actual
time=4819.999..4819.999 rows=0 loops=1)
-> Incremental Sort (cost=89.04..1079047.34 rows=10000054 width=20)
(actual time=0.203..4603.197 rows=10000000 loops=1)
Sort Key: s_2.a, s_2.b, s_2.c
Presorted Key: s_2.a, s_2.b
Sort Method: quicksort Memory: 135kB
Sort Groups: 10201
-> Index Scan using s_2_a_b_idx on s_2 (cost=0.43..406985.62
rows=10000054 width=20) (actual time=0.052..1461.177 rows=10000000 loops=1)
Planning time: 0.313 ms
Execution time: 4820.037 ms
(9 rows)

===== 182 [Wed Mar 7 11:20:11 CET 2018] scale:10000000 groups:100
work_mem:8MB incremental:off max_workers:2 =====
SELECT * FROM s_2 ORDER BY a, b, c
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1705580.76..1705580.76 rows=1 width=20) (actual
time=3985.818..3985.818 rows=0 loops=1)
-> Gather Merge (cost=649951.66..1622246.98 rows=8333378 width=20)
(actual time=1782.354..3750.868 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=648951.64..659368.36 rows=4166689 width=20)
(actual time=1778.362..2091.253 rows=3333333 loops=3)
Sort Key: s_2.a, s_2.b, s_2.c
Sort Method: external merge Disk: 99136kB
Worker 0: Sort Method: external merge Disk: 96984kB
Worker 1: Sort Method: external merge Disk: 97496kB
-> Parallel Seq Scan on s_2 (cost=0.00..105361.89
rows=4166689 width=20) (actual time=0.022..233.640 rows=3333333 loops=3)
Planning time: 0.265 ms
Execution time: 4007.591 ms
(12 rows)

The situation is similar to case #1 except that in the pair "Seq Scan +
Sort" Sort also
gets paralellyzed. In the same way as in previous case, disk writes/reads
during
external sort are overestimated, because they actually use OS cache.
I would also say that it's not necessary wrong decision of optimizer,
because
doing this work in single backend may consume less resources despite being
overall slower.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-03-10 17:19:07 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message legrand legrand 2018-03-10 15:43:47 Re: [PROPOSAL] timestamp informations to pg_stat_statements