Re: Memory usage during sorting

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-03-22 00:57:00
Message-ID: CAMkU=1xkzNcgtXL8ObvfdNygGw+C+tJbUJ0jkLpCHL+kLu_grw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 19, 2012 at 12:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
...
>
> One thing that seems inefficient to me about our current algorithm is
> the use of the run number as a leading column in the sort key.
> There's no real reason why the tuples destined for the next run need
> to be maintained in heap order; we could just store them unordered and
> heapify the whole lot of them when it's time to start the next run.
> That ought to save comparison cycles for the current run, since the
> heap will be less deep, and there are other possible savings as well -
> in particular, an unordered array of tuples can be heapify'd in linear
> time, whereas our current algorithm is O(n lg n).  However, when I
> actually implemented this, I found that doing it this way was a loser.
>  In fact, excluding the next-run tuples from the heap for the current
> run was a BIG loser even before you add in the time required to
> re-heapify.  This confused the daylights out of me for a while,
> because it's hard to understand how insert and siftup can be slower on
> a small heap than a large one.
>
> My working theory is that, even though we must necessarily do more
> comparisons when manipulating a larger heap, many of those comparisons
> are resolved by comparing run numbers, and that's a lot cheaper than
> invoking the real comparator.  For example, if we insert into a heap
> whose rightmost child is in the next run, and the new tuple goes into
> the current run, and the current size of the heap is such that the
> initial position of the new element is a descendent of the right node,
> it will very quickly crash through all of the next-run tuples and only
> need one REAL comparison - against the root.  Either the new element
> ends up as the root, or it ends up as the direct child of the root;
> now we remove the root and, perhaps, replace it with its rightmost
> child, meaning that the next element we read in can do the exact same
> thing all over again.  If we exclude the next-run elements from the
> heap, then the average depth of the heap when inserting a new element
> is smaller, but all the elements are in the same-run, and we have to
> invoke the real comparator every time.  In other words, those next-run
> tuples act as dummies which effectively create a heap of uneven depth,
> and the average depth encountered when inserting tuples turns out to
> be less than what we get by pulling out the dummies and making the
> depth uniform.
>
> This makes me kind of nervous, because I don't understand why things
> should work out so well as all that.  If throwing some dummy elements
> into the heap makes things work better, then maybe we should throw in
> some more.  Or maybe it would work better to take some but not all of
> them out.  There certainly doesn't seem to be any indication in the
> code that this is an anticipated effect, and it seems an awfully
> providential accident.

Is there a patch or a git repo available for this change? I'd like to
see if I can analyze that if I get some spare time.

...

> It turns out that it's possible to reduce the number of comparisons
> required to reinsert the last heap element from 2*heap_depth to
> heap_depth+lg(heap_depth).  At each node, compare the two children,
> and then let the current node be the lesser of those.  When you reach
> a leaf node, you've walked a path of length heap_depth which is known
> to be sorted (since we've got a heap), so you can find the correct
> reinsertion position by binary searching that path.  In the case of
> the sorted data mentioned above, this reduces the time to 19.45
> seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB.
> However, in other cases, it seems not to help at all, or even be a
> slight regression.

Clever. Rather than doing a binary search of the path, it might make
sense to do a reverse search of it. The tuple is likely to send up
somewhere at the bottom of the heap, both because that is where most
of the data is, and because the tuple being reinserted just came from
the bottom so it is likely to be biased that way.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-22 01:10:00 Re: HOT updates & REDIRECT line pointers
Previous Message Robert Haas 2012-03-22 00:50:24 heap_freeze_tuple locking requirements