Quick Links

Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeremy Harris <jgh(at)wizmail(dot)org>
Subject:	Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Date:	2015-08-03 20:36:11
Message-ID:	CA+TgmoYJYn1Y68=7UUZLYR8_ayWajd6VrHRY+Jno_SBNWCh_SQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> When it's time to drain the heap, in performsort, divide the array into two
>> arrays, based on the run number of each tuple, and then quicksort the arrays
>> separately. The first array becomes the in-memory tail of the current tape,
>> and the second array becomes the in-memory tail of the next tape.
>>
>> You wouldn't want to actually allocate two arrays and copy SortTuples
>> around, but keep using the single large array, just logically divided into
>> two. So the bookkeeping isn't trivial, but seems doable.
>
> You're talking about a new thing here, that happens when it is
> necessary to dump everything and do a conventional merging of on-tape
> runs. IOW, we cannot fit a significant fraction of overall tuples in
> memory, and we need much of the memtuples array for the next run
> (usually this ends as a TSS_FINALMERGE). That being the case
> (...right?),

I don't think that's what Heikki is talking about. He can correct me
if I'm wrong, but what I think he's saying is that we should try to
exploit the fact that we've already determined which in-memory tuples
can be part of the current run and which in-memory tuples must become
part of the next run. Suppose half the tuples in memory can become
part of the current run and the other half must become part of the
next run. If we throw all of those tuples into a single bucket and
quicksort it, we're losing the benefit of the comparisons we did to
figure out which tuples go in which runs. Instead, we could quicksort
the current-run tuples and, separately, quick-sort the next-run
tuples. Ignore the current-run tuples completely until the tape is
empty, and then merge them with any next-run tuples that remain.

I'm not sure if there's any reason to believe that would be faster
than your approach. In general, sorting is O(n lg n) so sorting two
arrays that are each half as large figures to be slightly faster than
sorting one big array. But the difference may not amount to much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" at 2015-08-03 19:33:15 from Peter Geoghegan

Responses

Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" at 2015-08-03 21:32:40 from Peter Geoghegan
Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort" at 2015-08-04 08:24:20 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jim Nasby	2015-08-03 20:55:06	Re: Minimum tuple threshold to decide last pass of VACUUM
Previous Message	Robert Haas	2015-08-03 20:18:12	Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"