Re: Why do we still perform a check for pre-sorted input within qsort variants?

From: Dann Corbit <DCorbit(at)connx(dot)com>
To: 'Greg Stark' <stark(at)mit(dot)edu>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
Date: 2013-03-09 20:52:28
Message-ID: 87F42982BF2B434F831FCEF4C45FC33E5BD369DF@EXCHANGE.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

-----Original Message-----
From: gsstark(at)gmail(dot)com [mailto:gsstark(at)gmail(dot)com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 11:39 AM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <DCorbit(at)connx(dot)com> wrote:
> There is no such thing as a quicksort that never goes quadratic. It was formally proven

The median of medians selection of the pivot gives you O(n*log(n)).

>>
No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sort would be totally without purpose.
And yet introspective sort is the algorithm of choice for the ANSI/ISO C++ standard directly because it will never go quadratic.
Bentley and Mcilroy's "Engineering a Sort Function" says this about their final qsort source code model chosen for their implementation in the footnote:
"Of course, quadratic behavior is still possible. One can generate fiendish inputs by bugging Quicksort: Consider
key values to be unknown initially. In the code for selecting a partition element, assign values in increasing
order as unknown keys are encountered. In the partitioning code, make unknown keys compare high."

Interestingly, some standard library qsort routines will go quadratic if simply given a set that contains random 0 and 1 values for the input set.

It has been formally proven that no matter what method used to choose the median (other than calculating the exact median of every partition which would make quicksort slower than heapsort) there exists an "adversary" distribution that makes quicksort go quadratic. (unfortunately, I can't find the proof or I would give you the citation). Median selection that is randomized does not select the true median unless it operates over all the elements of the entire partition. With some small probability, it makes the worst possible choice for the median. It also does not matter if you choose the first, last and middle elements for a median of three (or other evenly selected points for larger sample sets) or if you select the sample with randomization. Both methods have adversaries. Now, the quickselect algorithm can select the median in O(n) but that is again an average. There are also adversary distributions for quickselect.

On the other hand, various safeguards can make O(n*n) extremely improbable {even probabilities approaching zero}. There is a tradeoff with more and more complicated safeguards. If I choose large subsets of a partition and calculate the true median of the subset, it will make the routine safer from quadratic behavior but it will also make it slower. At some point, you end up with a routine no faster than heapsort, while also being much more complex and therefore more difficult to maintain. Hence, the necessity of introspective sort.

On the other hand, there are also data specific sorts that have very special properties. There are cache conscious versions of burstsort that can sort strings much faster than quicksort or even radix sort according to measurements. There is a special version of LSD radix sort that will sort an array in one counting pass and N distribution passes, where N is the radix. So, for instance, if you are sorting 4 byte integers and your radix is 256 {one byte} then you can sort in 5 passes. One pass to count the bytes in each integer by position, and 4 passes to distribute the data. This can also be done in place. MSD radix sort can also be used as an outer sort container which sorts by distribution until the partitions are small enough and then switches to introspective sort (which eventually switches to insertion sort). Radix sorts can also be made totally generic via callback routines like quicksort. Instead of a comparison operator, the callback gets request for the radix bits for a specific bin number and then returns the radix count bits for that specific bin. For unsigned char and radix 256, this is trivially character [i] from the string for bin [i], for example.

I guess improving sort routines all boils down to how much energy you want to put into it. Donald Knuth once stated that according to measurement, about 40% of business related mainframe computer cycles are involved in ordering data. So it is clearly an important problem.
<<

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dann Corbit 2013-03-09 21:10:33 Re: Why do we still perform a check for pre-sorted input within qsort variants?
Previous Message Fujii Masao 2013-03-09 19:50:15 Re: Support for REINDEX CONCURRENTLY