Re: Using quicksort for every external sort run

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-11-09 17:27:19
Message-ID: CADkLM=faYKwOfhDjVUFDVk_JGw1a_X-vqp-eQKJVnj9QQE4rQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 6, 2015 at 8:08 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> > I'll start a new thread for this, since my external sorting patch has
> > now evolved well past the original "quicksort with spillover"
> > idea...although not quite how I anticipated it would. It seems like
> > I've reached a good point to get some feedback.
>
> Corey Huinker has once again assisted me with this work, by doing some
> benchmarking on an AWS instance of his:
>
> 32 cores (c3.8xlarge, I suppose)
> MemTotal: 251902912 kB
>
> I believe it had one EBS volume.
>
> This testing included 2 data sets:
>
> * A data set that he happens to have that is representative of his
> production use-case. Corey had some complaints about the sort
> performance of PostgreSQL, particularly prior to 9.5, and I like to
> link any particular performance optimization to an improvement in an
> actual production workload, if at all possible.
>
> * A tool that I wrote, that works on top of sortbenchmark.org's
> "gensort" [1] data generation tool. It seems reasonable to me to drive
> this work in part with a benchmark devised by Jim Gray. He did after
> all receive a Turing award for this contribution to transaction
> processing. I'm certainly a fan of his work. A key practical advantage
> of that is that is has reasonable guarantees about determinism, making
> these results relatively easy to recreate independently.
>
> The modified "gensort" is available from
> https://github.com/petergeoghegan/gensort
>
> The python script postgres_load.py, which performs bulk-loading for
> Postgres using COPY FREEZE. It ought to be fairly self-documenting:
>
> $:~/gensort$ ./postgres_load.py --help
> usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c]
>
> optional arguments:
> -h, --help show this help message and exit
> -w WORKERS, --workers WORKERS
> Number of gensort workers (default: 4)
> -m MILLION, --million MILLION
> Generate n million tuples (default: 100)
> -s, --skew Skew distribution of output keys (default: False)
> -l, --logged Use logged PostgreSQL table (default: False)
> -c, --collate Use default collation rather than C collation
> (default: False)
>
> For this initial report to the list, I'm going to focus on a case
> involving 16 billion non-skewed tuples generated using the gensort
> tool. I wanted to see how a sort of a ~1TB table (1017GB as reported
> by psql, actually) could be improved, as compared to relatively small
> volumes of data (in the multiple gigabyte range) that were so improved
> by sorts on my laptop, which has enough memory to avoid blocking on
> physical I/O much of the time. How the new approach deals with
> hundreds of runs that are actually reasonably sized is also of
> interest. This server does have a lot of memory, and many CPU cores.
> It was kind of underpowered on I/O, though.
>
> The initial load of 16 billion tuples (with a sortkey that is "C"
> locale text) took about 10 hours. My tool supports parallel generation
> of COPY format files, but serial performance of that stage isn't
> especially fast. Further, in order to support COPY FREEZE, and in
> order to ensure perfect determinism, the COPY operations occur
> serially in a single transaction that creates the table that we
> performed a CREATE INDEX on.
>
> Patch, with 3GB maintenance_work_mem:
>
> ...
> LOG: performsort done (except 411-way final merge): CPU
> 1017.95s/17615.74u sec elapsed 23910.99 sec
> STATEMENT: create index on sort_test (sortkey );
> LOG: external sort ended, 54740802 disk blocks used: CPU
> 2001.81s/31395.96u sec elapsed 41648.05 sec
> STATEMENT: create index on sort_test (sortkey );
>
> So just over 11 hours (11:34:08), then. The initial sorting for 411
> runs took 06:38:30.99, as you can see.
>
> Master branch:
>
> ...
> LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec
> elapsed 34409.16 sec
> LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec
> elapsed 34580.41 sec
> LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec
> elapsed 34750.28 sec
> LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec
> LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec
> elapsed 34914.17 sec
> LOG: finished writing final run 206 to tape 205: CPU
> 1243.23s/31564.23u sec elapsed 34963.03 sec
> LOG: performsort done (except 206-way final merge): CPU
> 1243.86s/31570.58u sec elapsed 34974.08 sec
> LOG: external sort ended, 54740731 disk blocks used: CPU
> 2026.98s/48448.13u sec elapsed 55299.24 sec
> CREATE INDEX
> Time: 55299315.220 ms
>
> So 15:21:39 for master -- it's much improved, but this was still
> disappointing given the huge improvements on relatively small cases.
>
> Finished index was fairly large, which can be seen here by working
> back from "total relation size":
>
> postgres=# select pg_size_pretty(pg_total_relation_size('sort_test'));
> pg_size_pretty
> ----------------
> 1487 GB
> (1 row)
>
> I think that this is probably due to the relatively slow I/O on this
> server, and because the merge step is more of a bottleneck. As we
> increase maintenance_work_mem, we're likely to then suffer from the
> lack of explicit asynchronous I/O here. It helps, still, but not
> dramatically. With with maintenance_work_mem = 30GB, patch is somewhat
> faster (no reason to think that this would help master at all, so that
> was untested):
>
> ...
> LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed
> 24910.38 sec
> LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed
> 25140.69 sec
> LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec
> elapsed 25234.44 sec
> LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec
> LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed
> 25499.98 sec
> LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed
> 25700.43 sec
> LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec
> elapsed 25782.93 sec
> LOG: performsort done (except 41-way final merge): CPU
> 1965.43s/20086.28u sec elapsed 25980.80 sec
> LOG: external sort ended, 54740909 disk blocks used: CPU
> 3270.57s/31595.37u sec elapsed 40376.43 sec
> CREATE INDEX
> Time: 40383174.977 ms
>
> So that takes 11:13:03 in total -- we only managed to shave about 20
> minutes off the total time taken, despite a 10x increase in
> maintenance_work_mem. Still, at least it gets moderately better, not
> worse, which is certainly what I'd expect from the master branch. 60GB
> was half way between 3GB and 30GB in terms of performance, so it
> doesn't continue to help, but, again, at least things don't get much
> worse.
>
> Thoughts on these results:
>
> * I'd really like to know the role of I/O here. Better, low-overhead
> instrumentation is required to see when and how we are I/O bound. I've
> been doing much of that on a more-or-less ad hoc basis so far, using
> iotop. I'm looking into a way to usefully graph the I/O activity over
> many hours, to correlate with the trace_sort output that I'll also
> show. I'm open to suggestions on the easiest way of doing that. Having
> used the "perf" tool for instrumenting I/O at all in the past.
>
> * Parallelism would probably help us here *a lot*.
>
> * As I said, I think we suffer from the lack of asynchronous I/O much
> more at this scale. Will need to confirm that theory.
>
> * It seems kind of ill-advised to make run size (which is always in
> linear proportion to maintenance_work_mem with this new approach to
> sorting) larger, because it probably will hurt writing runs more than
> it will help in making merging cheaper (perhaps mostly due to the lack
> of asynchronous I/O to hide the latency of writes -- Linux might not
> do so well at this scale).
>
> * Maybe adding actual I/O bandwidth is the way to go to get a better
> picture. I wouldn't be surprised if we were very bottlenecked on I/O
> here. Might be worth using many parallel EBS volumes here, for
> example.
>
> [1] http://sortbenchmark.org/FAQ-2015.html
> --
> Peter Geoghegan
>

The machine in question still exists, so if you have questions about it,
commands you'd like me to run to give you insight as to the I/O
capabilities of the machine, let me know. I can't guarantee we'll keep the
machine much longer.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-11-09 17:31:18 Re: proposal: PL/Pythonu - function ereport
Previous Message Simon Riggs 2015-11-09 17:25:47 Re: can we add SKIP LOCKED to UPDATE?