Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeremy Harris <jgh(at)wizmail(dot)org>
Subject: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date: 2015-08-04 08:24:20
Message-ID: 55C076B4.6020009@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/03/2015 11:36 PM, Robert Haas wrote:
> On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>> When it's time to drain the heap, in performsort, divide the array into two
>>> arrays, based on the run number of each tuple, and then quicksort the arrays
>>> separately. The first array becomes the in-memory tail of the current tape,
>>> and the second array becomes the in-memory tail of the next tape.
>>>
>>> You wouldn't want to actually allocate two arrays and copy SortTuples
>>> around, but keep using the single large array, just logically divided into
>>> two. So the bookkeeping isn't trivial, but seems doable.
>>
>> You're talking about a new thing here, that happens when it is
>> necessary to dump everything and do a conventional merging of on-tape
>> runs. IOW, we cannot fit a significant fraction of overall tuples in
>> memory, and we need much of the memtuples array for the next run
>> (usually this ends as a TSS_FINALMERGE). That being the case
>> (...right?),
>
> I don't think that's what Heikki is talking about. He can correct me
> if I'm wrong, but what I think he's saying is that we should try to
> exploit the fact that we've already determined which in-memory tuples
> can be part of the current run and which in-memory tuples must become
> part of the next run. Suppose half the tuples in memory can become
> part of the current run and the other half must become part of the
> next run. If we throw all of those tuples into a single bucket and
> quicksort it, we're losing the benefit of the comparisons we did to
> figure out which tuples go in which runs. Instead, we could quicksort
> the current-run tuples and, separately, quick-sort the next-run
> tuples. Ignore the current-run tuples completely until the tape is
> empty, and then merge them with any next-run tuples that remain.

Yeah, something like that. To paraphrase, if I'm now understanding it
correctly, Peter's idea is:

When all the tuples have been fed to tuplesort, and it's time to perform
the sort, quicksort all the tuples currently in the heap, ignoring the
run numbers, and turn the resulting array into another tape. That tape
is special: it's actually stored completely in memory. It is merged with
the "real" tapes when tuples are returned from the tuplesort, just like
regular tapes in TSS_FINALMERGE.

And my idea is:

When all the tuples have been fed to tuplesort, and it's time to perform
the sort, take all the tuples in the heap belonging to currentRun,
quicksort them, and make them part of the current tape. They're not just
pushed to the tape as usual, however, but attached as in-memory tail of
the current tape. The logical tape abstraction will return them after
all the tuples already in the tape, as if they were pushed to the tape
as usual. Then take all the remaining tuples in the heap (if any),
belonging to next tape, and do the same for them. They become an
in-memory tail of the next tape.

> I'm not sure if there's any reason to believe that would be faster
> than your approach. In general, sorting is O(n lg n) so sorting two
> arrays that are each half as large figures to be slightly faster than
> sorting one big array. But the difference may not amount to much.

Yeah, I don't think there's a big performance difference between the two
approaches. I'm not wedded to either approach. Whichever approach we
use, my main point was that it would be better to handle this in the
logical tape abstraction. In my approach, you would have those
"in-memory tails" attached to the last two tapes. In Peter's approach,
you would have one tape that's completely in memory, backed by the
array. In either case, the tapes would look like normal tapes to most of
tuplesort.c. There would be no extra TSS state, it would be
TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.

The logical tape abstraction is currently too low-level for that. It's
just a tape of bytes, and tuplesort.c determines where a tuple begins
and ends. That needs to be changed so that the logical tape abstraction
works tuple-at-a-time instead. For example, instead of
LogicalTapeRead(N) which reads N bytes, you would have
LogicalTapeReadTuple(), which reads next tuple, and returns its length
and the tuple itself. But that would be quite sensible anyway.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2015-08-04 08:27:21 GROUP BY before JOIN
Previous Message Michael Paquier 2015-08-04 07:20:39 Re: WIP: SCRAM authentication