From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu> |
Subject: | Re: Using quicksort for every external sort run |
Date: | 2016-04-02 00:29:03 |
Message-ID: | CAM3SWZQe6nELBMw64Wwzr3zDSO4G-ZJOtao0-qODyCQGOFJuLw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Feb 4, 2016 at 3:14 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:
This paper is available from
http://www.vldb.org/journal/VLDBJ4/P603.pdf (previously link is now
dead)
> The paper also has very good analysis of the economics of sorting:
>
> "Even for surprisingly large sorts, it is economical to perform the
> sort in one pass."
I suggest taking a look at "Figure 2. Replacement-selection sort vs.
QuickSort" in the paper. It confirms what I said recently about cache
size. The diagram is annotated: "The tournament tree of
replacement-selection sort at left has bad cache behavior, unless the
entire tournament fits in cache". I think we're well justified in
giving no weight at all to cases where the *entire* tournament tree
(heap) fits in cache, because it's not economical to use a
cpu-cache-sized work_mem setting. It simply makes no sense.
I understand the reluctance to give up on replacement selection. The
authors of this paper were themselves reluctant to do so. As they put
it:
"""
We were reluctant to abandon replacement-selection sort, because it has
stability and it generates long runs. Our first approach was to improve
replacement-selection sort's cache locality. Standard replacement-selection
sort has terrible cache behavior, unless the tournament fits in cache. The
cache thrashes on the bottom levels of the tournament. If you think of the
tournament as a tree, each replacement-selection step traverses a path from a
pseudo-random leaf of the tree to the root. The upper parts of the tree may be
cache resident, but the bulk of the tree is not.
We investigated a replacement-selection sort that clusters tournament nodes so
that most parent-child node pairs are contained in the same cache line. This
technique reduces cache misses by a factor of two or three. Nevertheless,
replacement-selection sort is still less attractive than QuickSort because:
1. The cache behavior demonstrates less locality than QuickSorts. Even when
QuickSort runs did not fit entirely in cache, the average compare-exchange
time did not increase significantly.
2. Tournament sort is more CPU-intensive than QuickSort. Knuth calculated a 2:1
ratio for the programs he wrote. We observed a 2.5:1 speed advantage for
QuickSort over the best tournament sort we wrote.
The key to achieving high execution speeds on fast processors is to minimize
the number of references that cannot be serviced by the on-board cache (4MB in
the case of the DEC 7000 AXP). As mentioned before, QuickSort's memory access
patterns are sequential and, thus, have good cache behavior
"""
This paper is co-authored by Jim Gray, a Turing award laureate, as
well as some other very notable researchers. The paper appeared in
"Readings in Database Systems, 4th edition", which was edited by by
Joseph Hellerstein and Michael Stonebraker. These days, the cheapest
consumer level CPUs have 4MB caches (in 1994, that was exceptional),
so if this analysis wasn't totally justified in 1994, when the paper
was written, it is today.
I've spent a lot of time analyzing this problem. I've been looking at
external sorting in detail for almost a year now. I've done my best to
avoid any low-end regressions. I am very confident that I cannot do
any better than I already have there, though. If various very
influential figures in the database research community could not do
better, then I have doubts that we can. I started with the intuition
that we should still use replacement selection myself, but that just
isn't well supported by benchmarking cases with sensible
work_mem:cache size ratios.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2016-04-02 01:20:55 | Re: Support for N synchronous standby servers - take 2 |
Previous Message | Michael Paquier | 2016-04-02 00:14:24 | Re: Speedup twophase transactions |