From: | Greg Stark <stark(at)mit(dot)edu> |
---|---|
To: | Peter Geoghegan <pg(at)heroku(dot)com> |
Cc: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com> |
Subject: | Re: Using quicksort for every external sort run |
Date: | 2015-12-11 22:41:24 |
Message-ID: | CAM-w4HPbiXm4G7F1O5ZPoi0pvGS1HzrFy1LSb+iy_gvKywE+eA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Dec 9, 2015 at 2:44 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>
> I guess you mean insertion sort. What's the theoretical justification
> for the change?
Well my thinking was that hard coding a series of comparisons would be
faster than a loop doing a O(n^2) algorithm even for small constants.
And sort networks are perfect for hard coded sorts because they do the
same comparisons regardless of the results of previous comparisons so
there are no branches. And even better the comparisons are as much as
possible independent of each other -- sort networks are typically
measured by the depth which assumes any comparisons between disjoint
pairs can be done in parallel. Even if it's implemented in serial the
processor is probably parallelizing some of the work.
So I implemented a quick benchmark outside Postgres based on sorting
actual SortTuples with datum1 defined to be random 64-bit integers (no
nulls). Indeed the sort networks perform faster on average despite
doing more comparisons. That makes me think the cpu is indeed doing
some of the work in parallel.
However the number of comparisons is significantly higher. And in the
non-"abbreviated keys" case where the compare is going to be a
function pointer call the number of comparisons is probably more
important than the actual time spent when benchmarking comparing
int64s. In that case insertion sort does seem to be better than using
the sort networks.
Interestingly it looks like we could raise the threshold to switching
to insertion sort. At least on my machine the insertion sort is faster
in real time as well as fewer comparisons up to 9 elements. It's
actually faster up to 16 elements despite doing more comparisons than
quicksort.
Note also how our quicksort does more comparisons than the libc
quicksort (which is actually merge sort in glibc I hear) which is
probably due to the "presorted" check.
$ for i in `seq 2 32` ; do echo ; echo $i ; ./a.out $i ; done
2
using bitonic sort 32.781ns per sort of 2 24-byte items 1.0
compares/sort 0.5 swaps/sort
using insertion sort 29.805ns per sort of 2 24-byte items 1.0
compares/sort 0.5 swaps/sort
using sort networks sort 26.392ns per sort of 2 24-byte items 1.0
compares/sort 0.5 swaps/sort
using libc quicksort sort 54.250ns per sort of 2 24-byte items 1.0 compares/sort
using qsort_ssup sort 46.666ns per sort of 2 24-byte items 1.0 compares/sort
3
using insertion sort 42.090ns per sort of 3 24-byte items 2.7
compares/sort 1.5 swaps/sort
using sort networks sort 38.442ns per sort of 3 24-byte items 3.0
compares/sort 1.5 swaps/sort
using libc quicksort sort 86.759ns per sort of 3 24-byte items 2.7 compares/sort
using qsort_ssup sort 41.238ns per sort of 3 24-byte items 2.7 compares/sort
4
using bitonic sort 73.420ns per sort of 4 24-byte items 6.0
compares/sort 3.0 swaps/sort
using insertion sort 61.087ns per sort of 4 24-byte items 4.9
compares/sort 3.0 swaps/sort
using sort networks sort 58.930ns per sort of 4 24-byte items 5.0
compares/sort 2.7 swaps/sort
using libc quicksort sort 135.930ns per sort of 4 24-byte items 4.7
compares/sort
using qsort_ssup sort 59.669ns per sort of 4 24-byte items 4.9 compares/sort
5
using insertion sort 88.345ns per sort of 5 24-byte items 7.7
compares/sort 5.0 swaps/sort
using sort networks sort 90.034ns per sort of 5 24-byte items 9.0
compares/sort 4.4 swaps/sort
using libc quicksort sort 180.367ns per sort of 5 24-byte items 7.2
compares/sort
using qsort_ssup sort 85.603ns per sort of 5 24-byte items 7.7 compares/sort
6
using insertion sort 119.697ns per sort of 6 24-byte items 11.0
compares/sort 7.5 swaps/sort
using sort networks sort 122.071ns per sort of 6 24-byte items 12.0
compares/sort 5.4 swaps/sort
using libc quicksort sort 234.436ns per sort of 6 24-byte items 9.8
compares/sort
using qsort_ssup sort 115.407ns per sort of 6 24-byte items 11.0 compares/sort
7
using insertion sort 152.639ns per sort of 7 24-byte items 14.9
compares/sort 10.5 swaps/sort
using sort networks sort 155.357ns per sort of 7 24-byte items 16.0
compares/sort 7.3 swaps/sort
using libc quicksort sort 303.738ns per sort of 7 24-byte items 12.7
compares/sort
using qsort_ssup sort 166.174ns per sort of 7 24-byte items 16.0 compares/sort
8
using bitonic sort 248.527ns per sort of 8 24-byte items 24.0
compares/sort 12.0 swaps/sort
using insertion sort 193.057ns per sort of 8 24-byte items 19.3
compares/sort 14.0 swaps/sort
using sort networks sort 230.738ns per sort of 8 24-byte items 24.0
compares/sort 12.0 swaps/sort
using libc quicksort sort 360.852ns per sort of 8 24-byte items 15.7
compares/sort
using qsort_ssup sort 211.729ns per sort of 8 24-byte items 20.6 compares/sort
9
using insertion sort 222.475ns per sort of 9 24-byte items 24.2
compares/sort 18.0 swaps/sort
using libc quicksort sort 427.760ns per sort of 9 24-byte items 19.2
compares/sort
using qsort_ssup sort 249.668ns per sort of 9 24-byte items 24.6 compares/sort
10
using insertion sort 277.386ns per sort of 10 24-byte items 29.6
compares/sort 22.5 swaps/sort
using libc quicksort sort 482.730ns per sort of 10 24-byte items 22.7
compares/sort
using qsort_ssup sort 294.956ns per sort of 10 24-byte items 29.0 compares/sort
11
using insertion sort 312.613ns per sort of 11 24-byte items 35.5
compares/sort 27.5 swaps/sort
using libc quicksort sort 583.617ns per sort of 11 24-byte items 26.3
compares/sort
using qsort_ssup sort 353.054ns per sort of 11 24-byte items 33.5 compares/sort
12
using insertion sort 381.011ns per sort of 12 24-byte items 41.9
compares/sort 33.0 swaps/sort
using libc quicksort sort 640.265ns per sort of 12 24-byte items 30.0
compares/sort
using qsort_ssup sort 396.703ns per sort of 12 24-byte items 38.2 compares/sort
13
using insertion sort 407.784ns per sort of 13 24-byte items 48.8
compares/sort 39.0 swaps/sort
using libc quicksort sort 716.017ns per sort of 13 24-byte items 33.8
compares/sort
using qsort_ssup sort 443.356ns per sort of 13 24-byte items 43.1 compares/sort
14
using insertion sort 461.696ns per sort of 14 24-byte items 56.3
compares/sort 45.5 swaps/sort
using libc quicksort sort 782.418ns per sort of 14 24-byte items 37.7
compares/sort
using qsort_ssup sort 492.749ns per sort of 14 24-byte items 48.1 compares/sort
15
using insertion sort 528.879ns per sort of 15 24-byte items 64.1
compares/sort 52.5 swaps/sort
using libc quicksort sort 868.679ns per sort of 15 24-byte items 41.7
compares/sort
using qsort_ssup sort 537.568ns per sort of 15 24-byte items 53.3 compares/sort
16
using bitonic sort 835.212ns per sort of 16 24-byte items 80.0
compares/sort 40.0 swaps/sort
using insertion sort 575.019ns per sort of 16 24-byte items 72.6
compares/sort 60.0 swaps/sort
using libc quicksort sort 944.284ns per sort of 16 24-byte items 45.7
compares/sort
using qsort_ssup sort 591.027ns per sort of 16 24-byte items 58.5 compares/sort
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2015-12-11 22:52:59 | Re: Using quicksort for every external sort run |
Previous Message | Caleb Welton | 2015-12-11 22:41:08 | Re: Bootstrap DATA is a pita |