GSOC 2018 Project - A New Sorting Routine

From: Kefan Yang <starordust(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: GSOC 2018 Project - A New Sorting Routine
Date: 2018-07-10 21:02:02
Message-ID: CAF448YLZz5cD+w5qYQGMcfFPY9zAu2mkkiCtOPSBJmwbjuPNig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, Hackers!

I am working on my project in Google Summer of Code 2018
<https://wiki.postgresql.org/wiki/GSoC_2018#Sorting_algorithms_benchmark_and_implementation_.282018.29>.
In this project, I am trying to improve the in-memory sorting routine in
PostgreSQL. Now I am very excited to share my progress with you guys.

Originally, PostgreSQL is using the QuickSort implemented by J. L. Bentley
and M. D. McIlroy in "Engineering a sort function" with some modifications.
This sorting routine is very fast, yet may fall to O(n^2) time complexity
in the worst case scenario. We are trying to find faster sorting algorithms
with guaranteed O(nlogn) time complexity.

In this patch, I

1. Use IntroSort to implement pg_qsort. IntroSort is a hybrid sorting
algorithm. It uses Quicksort most of the time, but switch to insertion sort
when the array is small and heapsort when the recursion exceeds depth limit.
2. Only check if the array is preordered once on the whole array to get
better overall performance. Previously the sorting routine checks if the
array is preordered on every recursion.

After some performance test, I find the new sorting routine

1. Slightly faster on sorting random arrays.
2. Much faster on worst case scenario since it has O(nlogn) worst case
complexity.
3. Has nearly the same performance on mostly sorted arrays.

I use both standalone tests and pgbench to show the result. A more detailed
report is in the attachment, along with the patch and some scripts to
reproduce the result.

*Notice:* this patch may fail a test case in ‘make check’ because QuickSort
and HeapSort are unstable. The two sorting routines may give different
results if multiple entries in the array have the same key value.

Generally speaking, I think the new sorting routine seems to be an
improvement in many ways. Please let me know if you have any thoughts or
suggestions.

Regards,

Kefan Yang

Attachment Content-Type Size
NewSorting.zip application/x-zip-compressed 1.4 MB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2018-07-10 21:25:28 Re: Optimze usage of immutable functions as relation
Previous Message Tomas Vondra 2018-07-10 20:56:48 Re: cost_sort() improvements