From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com> |
Subject: | Re: Using quicksort for every external sort run |
Date: | 2015-12-07 17:01:53 |
Message-ID: | CAMkU=1zKBOzkX-nqE-kJFFMyNm2hMGYL9AsKDEUHhwXASsJEbg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Dec 2, 2015 at 10:03 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> While large sorts are uncommon in queries, they are much more common
> in index builds. Therefore, I think we ought to be worrying more
> about regressions at 64MB than at 4MB, because we ship with
> maintenance_work_mem = 64MB and a lot of people probably don't change
> it before trying to build an index.
You have more sympathy for people who don't tune their settings than I do.
Especially now that auovacuum_work_mem exists, there is much less
constraint on increasing maintance_work_mem than there is on work_mem.
Unless, perhaps, you have a lot of user-driven temp tables which get
indexes created on them.
> If we make those index builds go
> faster, users will be happy. If we make them go slower, users will be
> sad. So I think it's worth asking the question "are there any CREATE
> INDEX commands that someone might type on a system on which they've
> done no other configuration that will be slower with this patch"?
I found a regression on my 2nd attempt. I am indexing random md5
hashes (so they should get the full benefit of key abbreviation), and
in this case 400,000,000 of them:
create table foobar as select md5(random()::text) as x, random() as y
from generate_series(1,100000000);
insert into foobar select * from foobar ;
insert into foobar select * from foobar ;
Gives a 29GB table.
with the index:
create index on foobar (x);
With 64MB maintenance_work_mem, I get (best time of 7 or 8):
unpatched 2,436,483.834 ms
allpatches 3,964,875.570 ms 62% slower
not_0005 3,794,716.331 ms
The unpatched sort ends with a 118-way merge followed by a 233-way merge:
LOG: finished 118-way merge step: CPU 98.65s/835.67u sec elapsed 1270.61 sec
LOG: performsort done (except 233-way final merge): CPU
98.75s/835.88u sec elapsed 1276.14 sec
LOG: external sort ended, 2541465 disk blocks used: CPU
194.02s/1635.12u sec elapsed 2435.46 sec
The patched one ends with a 2-way, two sequential 233-way merges, and
a final 233-way merge:
LOG: finished 2-way merge step: CPU 62.08s/435.70u sec elapsed 587.52 sec
LOG: finished 233-way merge step: CPU 77.94s/660.11u sec elapsed 897.51 sec
LOG: a multi-pass external merge sort is required (234 tape maximum)
HINT: Consider increasing the configuration parameter "maintenance_work_mem".
LOG: finished 233-way merge step: CPU 94.55s/884.63u sec elapsed 1185.17 sec
LOG: performsort done (except 233-way final merge): CPU
94.76s/884.69u sec elapsed 1192.01 sec
LOG: external sort ended, 2541656 disk blocks used: CPU
202.65s/1771.50u sec elapsed 3963.90 sec
If you just look at the final merges of each, they should have the
same number of tuples going through them (i.e. all of the tuples) but
the patched one took well over twice as long, and all that time was IO
time, not CPU time.
I reversed out the memory pooling patch, and that shaved some time
off, but nowhere near bringing it back to parity.
I think what is going on here is that the different numbers of runs
with the patched code just makes it land in an anti-sweat spot in the
tape emulation and buffering algorithm.
Each tape gets 256kB of buffer. But two tapes have one third of the
tuples each other third are spread over all the other tapes almost
equally (or maybe one tape has 2/3 of the tuples, if the output of one
233-way nonfinal merge was selected as the input of the other one).
Once the large tape(s) has depleted its buffer, the others have had
only slightly more than 1kB each depleted. Yet when it goes to fill
the large tape, it also tops off every other tape while it is there,
which is not going to get much read-ahead performance on them, leading
to a lot of random IO.
Now, I'm not sure why this same logic wouldn't apply to the unpatched
code with 118-way merge too. So maybe I am all wet here. It seems
like that imbalance would be enough to also cause the problem.
I have seen this same type if things years ago, but was never able to
analyze it to my satisfaction (as I haven't been able to do now,
either).
So if this patch with this exact workload just happens to land on a
pre-existing infelicity, how big of a deal is that? It wouldn't be
creating a regression, just shoving the region that experiences the
problem around in such a way that it affects a different group of use
cases.
And perhaps more importantly, can anyone else reproduce this, or understand it?
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2015-12-07 17:05:55 | Re: jsonb_delete not documented |
Previous Message | Robert Haas | 2015-12-07 17:01:23 | Re: Foreign join pushdown vs EvalPlanQual |