From: | Peter Geoghegan <peter(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Memory usage during sorting |
Date: | 2012-04-13 18:01:16 |
Message-ID: | CAEYLb_UbhYiZGm4z9gXyOEDGWKO_-3VCoMAdk4dERDEvXk0m8A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 13 April 2012 18:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> We do use insertion sort for partitions of size < 7. I assume you are
> referring to something else.
I stand corrected. My confusion came from the removal of a later
*switch* to insertion sort in
a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649, which also added the
pre-sorted check where you'd expect to see the insertion switch. Of
course, the n < 7 insertion sort thing is right beside the added
check. So another, redundant copy of insertion sort was removed, and
not the one that almost every real-world quicksort implementation has.
> Heap-sorting could also benefit from some optimization in this area.
> Right now, if you heap-sort data that's already in order, it gets
> progressively slower as you crank up work_mem, because the heap gets
> deeper and so extraction and siftup get more expensive, for no real
> benefit. We could do something like check, every time we use up our
> available memory, whether the heap is already entirely in order. If
> so, we could dump all but one tuple to the current run and the start
> reading tuples again. Or maybe dump some percentage of the array,
> though that might require a bit more bookkeeping. Or maybe there's a
> smarter algorithm that would also cater to mostly-sorted data.
Well, timsort is specifically designed to take advantage of pre-sorted
data. It does appear to have a lot of traction, as wikipedia points
out:
"Timsort has been Python's standard sorting algorithm since version
2.3. It is now also used to sort arrays in Java SE 7,[2] and on the
Android platform.[3] "
Somebody has produced a BSD-licensed C implementation that draws
heavily on the Python/Java one here:
http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17
It looks like it has exactly the same interface as a c stdlib qsort.
So it'd be fairly easy to produce a timsort_arg() based on this, if
anyone cares to.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2012-04-13 18:15:10 | Re: [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'. |
Previous Message | Robert Haas | 2012-04-13 17:33:43 | Re: Memory usage during sorting |