From: | "Dann Corbit" <DCorbit(at)connx(dot)com> |
---|---|
To: | "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: No merge sort? |
Date: | 2003-04-08 19:37:04 |
Message-ID: | D90A5A6C612A39408103E6ECDD77B829408AC4@voyager.corporate.connx.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> -----Original Message-----
> From: Ron Peacetree [mailto:rjpeace(at)earthlink(dot)net]
> Sent: Monday, April 07, 2003 6:33 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] No merge sort?
>
>
> <cbbrowne(at)cbbrowne(dot)com> wrote in message
> news:20030407202001(dot)1EC3C58E0D(at)cbbrowne(dot)com(dot)(dot)(dot)
> > Ron Peacetree wrote:
> > > AFAIK, there are only 3 general purpose internal sorting
> techniques
> > > that have O(n) behavior:
> >
> > Hum? NO "general purpose" sorting technique has O(n) behaviour.
> >
> > The theoretical best scenario, _in general_, is O(n log n).
> The O(log(n!)) bound is only for comparison based sorts
> operating on arbitarily disordered input. There are general
> techniques that can sort in O(n) time if we can "break"
> either assumption. In a DB, we often are dealing with data
> that is only slightly disordered, or we are have enough meta
> knowledge that we can use more powerful ordering operators
> than simple comparisons, or both.
>
>
> > Insertion sort is expected to provide O(n^2) behaviour, and
> radix-like
> > schemes get arbitrarily memory hungry and have bad pathological
> results.
> >
> None of these comments is accurate.
> The sources for the following discussion are
> A= Vol 3 of Knuth 2nd ed, (ISBN 0-201-89685-0)
> B= Sedgewick's Algorithms in C, (0-201-51425-7)
> C= Handbook of Algorithms and Data Structures 2nd ed by
> Gonnet and Baeza-Yates. (0-201-41607-7)
>
> 1= Insertion sort is O(n) for "almost sorted" input.
> p103 of Sedgewick. There's also discussion on this in Knuth.
>
> 2= Distribution Sort and it's "cousins" which use more
> powerful ordering operators than simply comparisons are a=
> reasonably general, and b= O(n). Look in all three references.
>
> 3= A proper implementation of straight Radix sort using 8b or
> 16b at a time a= is NOT pathological, and b= is O(n).
> Sedgewick is the most blunt about it on p142-143, but Knuth
> discusses this as well.
There can be a problem here with 2 & 3. Imagine the following query:
SELECT company, division, plant, sum(revenue) FROM corporate GROUP BY
company, division, plant ORDER BY company, division, plant
Let's suppose further that company, division, and plant are all char 20.
For a given section of company + division, the first 40 characters will
be the same. So a radix or distribution sort will need 40 passes over
the identical data before it even gets to the differentiating field
plant.
Now, a comparison sort will be O(n*log(n)). The counting passes alone
for MSD radix sort will be 40 passes * N. How large will n have to be
before Radix sort outperforms a comparison based sort?
2^40 = 1,099,511,627,776
About one trillion items appears to be break even.
If they are fixed width, we could use LSD radix sort and count all the
bins in one pass. But that technique will fail with varchar.
Naturally, most long character fields will be varchar. The basic upshot
is that radix and distribution sorts work best with short keys.
> All of the above are stable, which quicksort is not. There
> are no "pessimal" inputs for any of the above that will force
> worst case behavior. For quicksort there are (they are
> =very= unlikely however). In real world terms, if you can use
> any of these approaches you should be able to internally sort
> your data between 2X and 3X faster.
>
>
> Unfortunately, most of us do not have the luxury of working
> with Memory Resident DB's. In the real world, disk access is
> an important part of our DB sorting efforts, and that changes
> things. Very fast internal sorting routines combined with
> multidisk merging algorithms that minimize overall disk I/O
> while maximizing the sequential nature of that I/O are the
> best we can do overall in such a situation.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
From | Date | Subject | |
---|---|---|---|
Next Message | Neil Conway | 2003-04-08 19:54:50 | Re: Anyone know why PostgreSQL doesn't support 2 phase |
Previous Message | cbbrowne | 2003-04-08 18:58:42 | Re: Complex database for testing, U.S. Census Tiger/UA |