Re: Memory usage during sorting

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-03-19 19:23:15
Message-ID: CA+Tgmob-KrZoH88d4W0C-FEcjaF=WFKxPg=TfZ1HZZpvHddokg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> So has somebody found a hole in the n log n lower bound on the cost of
> comparison-based sorting?  I thought that had been proven pretty
> rigorously.

There's not much danger of anyone finding a way around that bound
since the proof is quite trivial, but keep in mind that it's a
worst-case bound. For example, if you have reason to believe that the
input is likely to already be sorted, you can check for that case in
using just n-1 comparisons. If it turns out you're right, you can
"sort" the data in linear time. Heck, you can also check for
reverse-sorted input at the same time, if you like, and handle that
special case too. Of course, when the data is unordered, such
special-case checks hurt rather than help, even though it's still O(n
lg n) overall; those pesky constant factors do matter quite a bit.
Still, sometimes it's worth it: our quicksort algorithm checks for
sorted input on each recursive invocation, since quicksort degrades
rather badly in that scenario; but our heapsort doesn't contain a
similar check, probably because heapsort typically works fine in that
scenario.

One thing that seems inefficient to me about our current algorithm is
the use of the run number as a leading column in the sort key.
There's no real reason why the tuples destined for the next run need
to be maintained in heap order; we could just store them unordered and
heapify the whole lot of them when it's time to start the next run.
That ought to save comparison cycles for the current run, since the
heap will be less deep, and there are other possible savings as well -
in particular, an unordered array of tuples can be heapify'd in linear
time, whereas our current algorithm is O(n lg n). However, when I
actually implemented this, I found that doing it this way was a loser.
In fact, excluding the next-run tuples from the heap for the current
run was a BIG loser even before you add in the time required to
re-heapify. This confused the daylights out of me for a while,
because it's hard to understand how insert and siftup can be slower on
a small heap than a large one.

My working theory is that, even though we must necessarily do more
comparisons when manipulating a larger heap, many of those comparisons
are resolved by comparing run numbers, and that's a lot cheaper than
invoking the real comparator. For example, if we insert into a heap
whose rightmost child is in the next run, and the new tuple goes into
the current run, and the current size of the heap is such that the
initial position of the new element is a descendent of the right node,
it will very quickly crash through all of the next-run tuples and only
need one REAL comparison - against the root. Either the new element
ends up as the root, or it ends up as the direct child of the root;
now we remove the root and, perhaps, replace it with its rightmost
child, meaning that the next element we read in can do the exact same
thing all over again. If we exclude the next-run elements from the
heap, then the average depth of the heap when inserting a new element
is smaller, but all the elements are in the same-run, and we have to
invoke the real comparator every time. In other words, those next-run
tuples act as dummies which effectively create a heap of uneven depth,
and the average depth encountered when inserting tuples turns out to
be less than what we get by pulling out the dummies and making the
depth uniform.

This makes me kind of nervous, because I don't understand why things
should work out so well as all that. If throwing some dummy elements
into the heap makes things work better, then maybe we should throw in
some more. Or maybe it would work better to take some but not all of
them out. There certainly doesn't seem to be any indication in the
code that this is an anticipated effect, and it seems an awfully
providential accident.

It may (or may not) be related to the effect that Greg Stark mentions
in a nearby post: increasing work_mem makes the heap bigger, so heap
maintenance gets more expensive, sometimes without any corresponding
benefit. For example, if the input happens to already be sorted, then
inserting into the heap is cheap, because each new element is already
greater than its parent, and life is good. But removing from the heap
is painfully expensive, because to remove element from the heap we
stick the last element in the heap into the whole left by the
departing element and then sift it down. However, that requires
2*heap_depth comparisons, all of which involve really invoking the
comparison function since everything's going to end up going into a
single run, and that gets more and more expensive as work_mem
increases. So a heap sort of already-sorted data seems to get slower
and slower the more memory you give it to work with. For example, a
sort of 5 million random strings of length 30, all in shared_buffers,
takes 22.03 seconds on my MacBook Pro with work_mem = 128MB, but just
17.08 seconds with work_mem = 1MB.

It turns out that it's possible to reduce the number of comparisons
required to reinsert the last heap element from 2*heap_depth to
heap_depth+lg(heap_depth). At each node, compare the two children,
and then let the current node be the lesser of those. When you reach
a leaf node, you've walked a path of length heap_depth which is known
to be sorted (since we've got a heap), so you can find the correct
reinsertion position by binary searching that path. In the case of
the sorted data mentioned above, this reduces the time to 19.45
seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB.
However, in other cases, it seems not to help at all, or even be a
slight regression. An obvious weakness of this algorithm is that the
linear search we use now can terminate early, if it turns out that the
element we're reinserting doesn't need to percolate downward very far,
whereas this can't: so the 2*heap_depth bound is worst-case, but the
heap_depth+lg(heap_depth) bound is exact. The problem of some
comparisons being much cheaper than other also acts to muddy the
waters - if there's just one element in the heap destined for the next
run, reinserting it will be only about half as expensive, since we'll
compare the descendents of each node to each other, but the comparison
against the next-run element will be very cheap, and if all the tuples
are the same size we may end up reinserting that particular node very
frequently.

Heikki wrote a patch implementing another "smart" reinsertion
algorithm (what we call tuplesort_heap_siftup) which I won't attempt
to explain that seems to work even a little bit better than the one
mentioned above, but it seems to fall prey to some of the same issues
- there are cases where it helps, but there are also cases where it
doesn't do much or even regresses slightly, and I don't think either
of us understand exactly how to tickle the various cases, let alone
what causes them. What does seem clear is that the great bulk of the
cost of a heap sort seems to be attributable to tuplesort_heap_siftup,
and if we can find something that reduces the number of comparisons
required for that operation we will more-or-less proportional speedup
in heap-sorting overall.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2012-03-19 19:29:55 Re: Regarding column reordering project for GSoc 2012
Previous Message Atri Sharma 2012-03-19 19:20:09 Re: Regarding column reordering project for GSoc 2012