Re: Why do we still perform a check for pre-sorted input within qsort variants?

From: Dann Corbit <DCorbit(at)connx(dot)com>
To: 'Greg Stark' <stark(at)mit(dot)edu>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
Date: 2013-03-09 10:32:57
Message-ID: 87F42982BF2B434F831FCEF4C45FC33E5BD367BB@EXCHANGE.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

----Original Message-----
>From: gsstark(at)gmail(dot)com [mailto:gsstark(at)gmail(dot)com] On Behalf Of Greg Stark
>Sent: Friday, March 08, 2013 4:59 PM
>To: Dann Corbit
>Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
>Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
>
>On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit(at)connx(dot)com> wrote:
>> Checking for pre-sorted input will not make the routine faster on average.
>> However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take K*10^12th operations when the bad condition does occur.
>> Checking for pre-sorted data can be thought of as an insurance policy. It's kind of a pain to pay the premiums, but you sure are glad you have it when an accident occurs.
>> Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant.
>
>Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could implement the pivot choice algorithm that guarantees n*log(n) worst-case.

There are three things that give qsort fits:
1. Large ascending partitions.
2. Large descending partitions.
3. Bad choice of the median.
If you examine the link in the previous article I gave:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
It has a subfolder located by following this link:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
In that folder is a file called qsortpdq.c
In file qsortpdq.c is a function:
void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const void *));
which is an interface that checks for ascending partitions, descending partitions, and does a median of 3 choice for the pivot.
Now, even this routine will have very rare inputs that can cause it to go quadratic. The probability of one of these particular inputs is very low.

If you want 100% guarantee against quadratic behavior, then qsortpdq can be wrapped in heapsort with a recursion depth check. That sort would never go quadratic. That method is called introspective sort.
Here are some links that describe introspective sort:
http://en.wikipedia.org/wiki/Introsort
https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html
And this is the real thing, right from the horse's mouth:
http://www.cs.rpi.edu/~musser/gp/introsort.ps

There is no such thing as a quicksort that never goes quadratic. It was formally proven. I cannot find the proof that I once read, but the article "A Killer Adversary for Quicksort" by M. D. MCILROY answers pretty nearly.

Here is the summary from that paper:

SUMMARY
Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of
items compared. The technique is illustrated by a specific adversary for the standard C qsort function.
The generalmethod works against any implementation of quicksort-even a randomizing one-that satisfies
certain very mild and realistic assumptions.

>But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We do care about the average case as well as the worst-case.
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
will build the sort system and test it.
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
will build the simpler sort system and test it.

I would suggest, in addition, the use of this routine to create a tough data set for robustness:
http://www.cs.dartmouth.edu/~doug/aqsort.c

>There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which of these defense mechanisms is the best trade-off.
I'm partial to chapter 13 of "C Unleashed" in that regard. As far as the trade-offs to, the problem is that they really are trade-offs. However, the cost is quite small (benchmark and see) and the risk of not performing the checks is enormous. Mostly sorted data happens all the time in real life, as do other perverse distributions that cause problems for sorting algorithms.to believe there isn't a pretty solid consensus on which of these defense mechanisms is the best trade-off.

My own personal recommendation would be to include every single safeguard (all three data checks and a recursion depth check as well).
That's what we do here (I work at a database tools company).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit kapila 2013-03-09 12:24:03 Re: Identity projection
Previous Message Ian Pilcher 2013-03-09 08:52:43 Re: Trust intermediate CA for client certificates