From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com> |
Subject: | Re: Using quicksort for every external sort run |
Date: | 2015-11-07 01:08:24 |
Message-ID: | CAM3SWZQtdd=Q+EF1xSZaYG1CiOYQJ7sZFcL08GYqChpJtGnKMg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> I'll start a new thread for this, since my external sorting patch has
> now evolved well past the original "quicksort with spillover"
> idea...although not quite how I anticipated it would. It seems like
> I've reached a good point to get some feedback.
Corey Huinker has once again assisted me with this work, by doing some
benchmarking on an AWS instance of his:
32 cores (c3.8xlarge, I suppose)
MemTotal: 251902912 kB
I believe it had one EBS volume.
This testing included 2 data sets:
* A data set that he happens to have that is representative of his
production use-case. Corey had some complaints about the sort
performance of PostgreSQL, particularly prior to 9.5, and I like to
link any particular performance optimization to an improvement in an
actual production workload, if at all possible.
* A tool that I wrote, that works on top of sortbenchmark.org's
"gensort" [1] data generation tool. It seems reasonable to me to drive
this work in part with a benchmark devised by Jim Gray. He did after
all receive a Turing award for this contribution to transaction
processing. I'm certainly a fan of his work. A key practical advantage
of that is that is has reasonable guarantees about determinism, making
these results relatively easy to recreate independently.
The modified "gensort" is available from
https://github.com/petergeoghegan/gensort
The python script postgres_load.py, which performs bulk-loading for
Postgres using COPY FREEZE. It ought to be fairly self-documenting:
$:~/gensort$ ./postgres_load.py --help
usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]
optional arguments:
-h, --help show this help message and exit
-w WORKERS, --workers WORKERS
Number of gensort workers (default: 4)
-m MILLION, --million MILLION
Generate n million tuples (default: 100)
-s, --skew Skew distribution of output keys (default: False)
-l, --logged Use logged PostgreSQL table (default: False)
-c, --collate Use default collation rather than C collation
(default: False)
For this initial report to the list, I'm going to focus on a case
involving 16 billion non-skewed tuples generated using the gensort
tool. I wanted to see how a sort of a ~1TB table (1017GB as reported
by psql, actually) could be improved, as compared to relatively small
volumes of data (in the multiple gigabyte range) that were so improved
by sorts on my laptop, which has enough memory to avoid blocking on
physical I/O much of the time. How the new approach deals with
hundreds of runs that are actually reasonably sized is also of
interest. This server does have a lot of memory, and many CPU cores.
It was kind of underpowered on I/O, though.
The initial load of 16 billion tuples (with a sortkey that is "C"
locale text) took about 10 hours. My tool supports parallel generation
of COPY format files, but serial performance of that stage isn't
especially fast. Further, in order to support COPY FREEZE, and in
order to ensure perfect determinism, the COPY operations occur
serially in a single transaction that creates the table that we
performed a CREATE INDEX on.
Patch, with 3GB maintenance_work_mem:
...
LOG: performsort done (except 411-way final merge): CPU
1017.95s/17615.74u sec elapsed 23910.99 sec
STATEMENT: create index on sort_test (sortkey );
LOG: external sort ended, 54740802 disk blocks used: CPU
2001.81s/31395.96u sec elapsed 41648.05 sec
STATEMENT: create index on sort_test (sortkey );
So just over 11 hours (11:34:08), then. The initial sorting for 411
runs took 06:38:30.99, as you can see.
Master branch:
...
LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec
elapsed 34409.16 sec
LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec
elapsed 34580.41 sec
LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec
elapsed 34750.28 sec
LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec
LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec
elapsed 34914.17 sec
LOG: finished writing final run 206 to tape 205: CPU
1243.23s/31564.23u sec elapsed 34963.03 sec
LOG: performsort done (except 206-way final merge): CPU
1243.86s/31570.58u sec elapsed 34974.08 sec
LOG: external sort ended, 54740731 disk blocks used: CPU
2026.98s/48448.13u sec elapsed 55299.24 sec
CREATE INDEX
Time: 55299315.220 ms
So 15:21:39 for master -- it's much improved, but this was still
disappointing given the huge improvements on relatively small cases.
Finished index was fairly large, which can be seen here by working
back from "total relation size":
postgres=# select pg_size_pretty(pg_total_relation_size('sort_test'));
pg_size_pretty
----------------
1487 GB
(1 row)
I think that this is probably due to the relatively slow I/O on this
server, and because the merge step is more of a bottleneck. As we
increase maintenance_work_mem, we're likely to then suffer from the
lack of explicit asynchronous I/O here. It helps, still, but not
dramatically. With with maintenance_work_mem = 30GB, patch is somewhat
faster (no reason to think that this would help master at all, so that
was untested):
...
LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
24910.38 sec
LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
25140.69 sec
LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
elapsed 25234.44 sec
LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec
LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed
25499.98 sec
LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed
25700.43 sec
LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec
elapsed 25782.93 sec
LOG: performsort done (except 41-way final merge): CPU
1965.43s/20086.28u sec elapsed 25980.80 sec
LOG: external sort ended, 54740909 disk blocks used: CPU
3270.57s/31595.37u sec elapsed 40376.43 sec
CREATE INDEX
Time: 40383174.977 ms
So that takes 11:13:03 in total -- we only managed to shave about 20
minutes off the total time taken, despite a 10x increase in
maintenance_work_mem. Still, at least it gets moderately better, not
worse, which is certainly what I'd expect from the master branch. 60GB
was half way between 3GB and 30GB in terms of performance, so it
doesn't continue to help, but, again, at least things don't get much
worse.
Thoughts on these results:
* I'd really like to know the role of I/O here. Better, low-overhead
instrumentation is required to see when and how we are I/O bound. I've
been doing much of that on a more-or-less ad hoc basis so far, using
iotop. I'm looking into a way to usefully graph the I/O activity over
many hours, to correlate with the trace_sort output that I'll also
show. I'm open to suggestions on the easiest way of doing that. Having
used the "perf" tool for instrumenting I/O at all in the past.
* Parallelism would probably help us here *a lot*.
* As I said, I think we suffer from the lack of asynchronous I/O much
more at this scale. Will need to confirm that theory.
* It seems kind of ill-advised to make run size (which is always in
linear proportion to maintenance_work_mem with this new approach to
sorting) larger, because it probably will hurt writing runs more than
it will help in making merging cheaper (perhaps mostly due to the lack
of asynchronous I/O to hide the latency of writes -- Linux might not
do so well at this scale).
* Maybe adding actual I/O bandwidth is the way to go to get a better
picture. I wouldn't be surprised if we were very bottlenecked on I/O
here. Might be worth using many parallel EBS volumes here, for
example.
[1] http://sortbenchmark.org/FAQ-2015.html
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2015-11-07 01:18:57 | Re: Bitmap index scans use of filters on available columns |
Previous Message | Tomas Vondra | 2015-11-07 00:11:41 | Re: Bitmap index scans use of filters on available columns |