From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Peter Geoghegan <pg(at)heroku(dot)com> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu> |
Subject: | Re: Using quicksort for every external sort run |
Date: | 2016-02-05 17:31:10 |
Message-ID: | CA+TgmoY87y9FuZ=NE7JayH2emtovm9Jp9aLfFWunjF3utq4hfg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Feb 4, 2016 at 6:14 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> The economics of using 4MB or even 20MB to sort 10GB of data are
> already preposterously bad for everyone that runs a database server,
> no matter how budget conscious they may be. I can reluctantly accept
> that we need to still use a heap with very low work_mem settings to
> avoid the risk of a regression (in the event of a strong correlation)
> on general principle, but I'm well justified in proposing "just don't
> do that" as the best practical advice.
>
> I thought I had your agreement on that point, Robert; is that actually the case?
Peter and I spent a few hours talking on Skype this morning about this
point and I believe we have agreed on an algorithm that I think will
address all of my concerns and hopefully also be acceptable to him.
Peter, please weigh in and let me know if I've gotten anything
incorrect here or if you think of other concerns afterwards.
The basic idea is that we will add a new GUC with a name like
replacement_sort_mem that will have a default value in the range of
20-30MB; or possibly we will hardcode this value, but for purposes of
this email I'm going to assume it's a GUC. If the value of work_mem
or maintenance_work_mem, whichever applies, is smaller than the value
of replacement_sort_mem, then the latter has no effect. However, if
replacement_sort_mem is the smaller value, then the amount of memory
that can be used for a heap with replacement selection is limited to
replacement_sort_mem: we can use more memory than that in total for
the sort, but the amount that can be used for a heap is restricted to
that value. The way we do this is explained in more detail below.
One thing I just thought of (after the call) is that it might be
better for this GUC to be in units of tuples rather than in units of
memory; it's not clear to me why the optimal heap size should be
dependent on the tuple size, so we could have a threshold like 300,000
tuples or whatever. But that's a secondary issue and I might be
wrong about it: the point is that in order to have a chance of
winning, a heap used for replacement selection needs to be not very
big at all by the standards of modern hardware, so the plan is to
limit it to a size at which it may have a chance.
Here's how that will work, assuming Peter and I understand each other:
1. We start reading the input data. If we reach the end of the input
data before (maintenance_)work_mem is exhausted, then we can simply
quicksort the data and we're done. This is no different than what we
already do today.
2. If (maintenance_)work_mem fills up completely, we will quicksort
all of the data we have in memory. We will then regard the tail end
of that sorted data, in an amount governed by replacement_sort_mem, as
a heap, and use it to perform replacement selection until no tuples
remain for the current run. Meanwhile, the rest of the sorted data
remains in memory untouched. Logically, we're constructing a run of
tuples which is split between memory and disk: the head of the run
(what fits in all of (maintenance_)work_mem except for
replacement_sort_mem) is in memory, and the tail of the run is on
disk.
3. If we reach the end of input before replacement selection runs out
of tuples for the current run, and if it finds no tuples for the next
run prior to that time, then we are done. All of the tuples form a
single run and we can return the tuples in memory first followed by
the tuples on disk. This case is highly likely to be a huge win over
what we have today, because (a) some portion of the tuples were sorted
via quicksort rather than heapsort and that's faster, (b) the tuples
that were sorted using a heap were sorted using a small heap rather
than a big one, and (c) we only wrote out the minimal number of tuples
to tape instead of, as we would have done today, all of them.
4. If we reach this step, then replacement selection with a small heap
wasn't able to sort the input in a single run. We have a bunch of
sorted data in memory which is the head of the same run whose tail is
already on disk; we now spill all of these tuples to disk. That
leaves only the heapified tuples in memory. We just ignore the fact
that they are a heap and treat them as unsorted. We repeatedly do the
following: read tuples until work_mem is full, sort them, and dump the
result to disk as a run. When all runs have been created, we merge
runs just as we do today.
This algorithm seems very likely to beat what we do today in
practically all cases. The benchmarking Peter and others have already
done shows that building runs with quicksort rather than replacement
selection can often win even if the larger number of tapes requires a
multi-pass merge. The only cases where it didn't seem to be a clear
win involved data that was already in sorted order, or very close to
it. But with this algorithm, presorted input is fine: we'll quicksort
some of it (which is faster than replacement selection because
quicksort checks for presorted input) and sort the rest with a *small*
heap (which is faster than our current approach of sorting it with a
big heap when the data is already in order). On top of that, we'll
only write out the minimal amount of data to disk rather than all of
it. So we should still win. On the other hand, if the data is out of
order, then we will do only a little bit of replacement selection
before switching over to building runs by quicksorting, which should
also win.
The worst case I was able to think of for this algorithm is an input
stream that is larger than work_mem and almost sorted: the only
exception is that the record that should be exactly in the middle is
all the way at the end. In that case, today's code will use a large
heap and will consequently produce only a single run. The algorithm
above will end up producing two runs, the second containing only that
one tuple. That means we're going to incur the additional cost of a
merge pass. On the other hand, we're also going to have substantial
savings to offset that - the building-runs stage will save by using
quicksort for some of the data and a small heap for the rest. So the
cost to merge the runs will be at least partially, maybe completely,
offset by reduced time spent building them. Furthermore, Peter has
got other improvements in the patch which also make merging faster, so
if we don't buy enough building the runs to completely counterbalance
the cost of the merge, well, we may still win for that reason. Even
if not, this is so much faster overall that a regression in some sort
of constructed worst case isn't really important. I feel that
presorted input is a sufficiently common case that we should try hard
not to regress it - but presorted input with the middle value moved to
the end is not. We need to not be horrible in that case, but there's
absolutely no reason to believe that we will be. We may even be
faster, but we certainly shouldn't be abysmally slower.
Doing it this way also avoids the need to have a cost model that makes
decisions on how to sort based on the anticipated size of the input.
I'm really very happy about that, because I feel that any such cost
model, no matter how good, is a risk: estimation errors are not
uncommon. Maybe a really sturdy cost model would be OK in the end,
but not needing one is better. We don't need to fear burning a lot of
time on replacement selection, because the heap is small - any
significant amount of out-of-order data will cause us to switch to the
main algorithm, which is building runs by quicksorting. The decision
is made based on the actual data we see rather than any estimate.
There's only one potentially tunable parameter - replacement_sort_mem
- but it probably won't hurt you very much even if it's wrong by a
factor of two - and there's no reason to believe that value is going
to be very different on one machine than another. So this seems like
it should be pretty robust.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Jason Petersen | 2016-02-05 17:40:48 | Re: Development with Eclipse - Wrong error messages in IDE |
Previous Message | Tomas Vondra | 2016-02-05 17:11:42 | Re: silent data loss with ext4 / all current versions |