Re: No merge sort?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: No merge sort?
Date: 2003-04-07 22:26:24
Message-ID: D90A5A6C612A39408103E6ECDD77B829408AC0@voyager.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Greg Stark [mailto:gsstark(at)mit(dot)edu]
> Sent: Monday, April 07, 2003 3:06 PM
> To: Dann Corbit
> Cc: Greg Stark; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] No merge sort?
>
> "Dann Corbit" <DCorbit(at)connx(dot)com> writes:
>
> > By the use of a randomized sample of min(n,max(3,log(n))) elements,
> > the probability of choosing the worst case median is so
> close to zero
> > as to never happen in practice.
>
> Not exactly. In fact the probability is no different than
> otherwise, but it won't be repeatable and it won't be on
> common data sets like pre-sorted data.

This is wrong. If I choose a single median, based on a pattern (e.g.
first/last/middle elements) then the probability is very high that I
have made a bad choice. If I choose the median from a sample of 3, then
it is less likely that I have made a bad choice. If I choose the median
from a sample of 32 elements chosen at random, it becomes extremely
unlikely that I have made a stupid choice. If I choose the median based
by examination of every element, the chance that I have chosen the wrong
pivot is zero. This technique [choosing from several sample pivots from
a random sample] clearly makes failure due to perverse cases less
likely. Also, a good quicksort algorithm will also make a scan for
sequential or reversed data ordering on the partition.

By using a sample of many medians, the chances of choosing the wrong
median go down. By randomizing the selection, the changes of being
fooled by a pattern go down. Most real data has some pattern to it.

> > When sorting data, if all the information fits into memory any good
> > O(n*log(n)) technique is probably good enough. For one million
> > records,
> > n*log2(n) is just 20 million. 20 million comparison and exchange
> > operations are an eye-blink on fast, modern hardware when
> performed in
> > memory.
>
> Woah, this is extremely dangerous thinking. For all you know
> the application needs to execute this sort 200 times per second...

We are talking about SQL database applications. If they need to sort
one million records 20 times per second, they have a serious design
flaw, either in the database or the application.

> Applications that can execute their queries entirely in
> memory are frequently OLTP applications that need response
> within a hard upper limit. Often that upper limit is measured
> in milliseconds...

The choice of internal sort routine (for any of the good ones) will only
change the speed by a small, constant factor for a given distribution.
For any choice of algorithm, there will be some data sets where any of
the alternatives would have been better.

These algorithms are all about the same:
Quicksort (as modified by Richard Singleton -- the original is useless
for serious work)
Heapsort
Mergesort

The heapsort algorithm has the highest constant cost of the three, but
never goes perverse and does not require additional memory.
The mergesort algorithm requires additional memory and the allocation
can fail. This means that it is not certain that it can be performed.
The quicksort algorithm has three perverse cases for the standard
algorithm (sorted/reversed/organpipe) but all of these can be detected
and prevented.

> > The thing we should worry about is what happens when disk I/O rears
> > its ugly head. You have a better solution to that
> situation, and you
> > have a better sorting routine for a database.
>
> Not "better" just different. When dealing with disk I/O the
> database has to use an algorithm that doesn't require too
> much random access reads. Disk is quite fast at reading and
> writing sequentially but extremely slow and random access.
>
>
> While I/O bandwidth is often the bottleneck on databases it's
> also a lot easier to deal with. You can make huge stripesets
> and use super fast i/o controllers, to the point of
> saturating the machine's internal bus. It's much harder to
> increase the amount of cpu cycles available.
>
> Actually, in my experience cpu has been a bottleneck as often
> as disk i/o. But my experience is quite heavily skewed to
> OLTP applications.
>
> --
> greg
>
>

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-04-08 00:02:07 Re: Backpatch FK changes to 7.3 and 7.2?
Previous Message Greg Stark 2003-04-07 22:05:56 Re: No merge sort?