Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2016-02-15 04:01:27
Message-ID: CAM3SWZSmHQG6xk+na47VVzVUGU4CmDp6_ijYd+S=cetk3OEHXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> I'm not even sure this is necessary. The idea of missing out on
>> producing a single sorted run sounds bad but in practice since we
>> normally do the final merge on the fly there doesn't seem like there's
>> really any difference between reading one tape or reading two or three
>> tapes when outputing the final results. There will be the same amount
>> of I/O happening and a 2-way or 3-way merge for most data types should
>> be basically free.
>
> I basically agree with you, but it seems possible to fix the
> regression (generally misguided though those regressed cases are).
> It's probably easiest to just fix it.

Here is a benchmark on my laptop:

$ pgbench -i -s 500 --unlogged

This results in a ~1GB accounts PK:

postgres=# \di+ pgbench_accounts_pkey
List of relations
─[ RECORD 1 ]──────────────────────
Schema │ public
Name │ pgbench_accounts_pkey
Type │ index
Owner │ pg
Table │ pgbench_accounts
Size │ 1071 MB
Description │

The query I'm testing is: "reindex index pgbench_accounts_pkey;"

Now, with a maintenance_work_mem of 5MB, the most recent revision of
my patch takes about 54.2 seconds to complete this, as compared to
master's 44.4 seconds. So, clearly a noticeable regression there of
just under 20%. I did not see a regression with a 5MB
maintenance_work_mem when pgbench scale was 100, though. And, with the
default maintenance_work_mem of 64MB, it's a totally different story
-- my patch takes about 28.3 seconds, whereas master takes 48.5
seconds (i.e. longer than with 5MB). My patch needs a 56-way final
merge with the 64MB maintenance_work_mem case, and 47 distinct merge
steps, plus a final on-the-fly merge for the 5MB maintenance_work_mem
case. So, a huge amount of merging, but RS still hardly pays for
itself. With the regressed case for my patch, we finish sorting *runs*
about 15 seconds in to a 54.2 second operation -- very early. So it
isn't "quicksort vs replacement selection", so much as "polyphase
merge vs replacement selection". There is a good reason to think that
we can make progress on fixing that regression by doubling down on the
general strategy of improving cache characteristics, and being
cleverer about memory use during non-final merging, too.

I looked at what it would take to make the heap a smaller part of
memtuples, along the lines Robert and I talked about, and I think it's
non-trivial because it needs to make the top of the heap something
other than memtuples[0]. I'd need to change the heap code, which
already has 3 reasons for existing (RS, merging, and top-N heap). I'll
find it really hard to justify the effort, and especially the risk of
adding bugs, for a benefit that there is *scant* evidence for. My
guess is that the easiest, and most sensible way to fix the ~20%
regression seen here is to introduce batch memory allocation to
non-final merge steps, which is where most time was spent. (For
simplicity, that currently only happens during the final merge phase,
but I could revisit that -- seems not that hard).

Now, I accept that the cost model has to go. So, what I think would be
best is if we still added a GUC, like the replacement_sort_mem
suggestion that Robert made. This would be a threshold for using what
is currently called "quicksort with spillover". There'd be no cost
model. Jeff Janes also suggested something like this.

The only regression that I find concerning is the one reported by Jeff
Janes [1]. That didn't even involve a correlation, though, so no
reason to think that it would be at all helped by what Robert and I
talked about. It seemed like the patch happened to have the effect of
tickling a pre-existing problem with polyphase merge -- what Jeff
called an "anti-sweetspot". Jeff had a plausible theory for why that
is.

So, what if we try to fix polyphase merge? That would be easier. We
could look at the tape buffer size, and the number of tapes, as well
as memory access patterns. We might even make more fundamental changes
to polyphase merge, since we don't use the more advanced variant that
I think correlation is a red herring. Knuth suggests that his
algorithm 5.4.3, cascade merge, has more efficient distribution of
runs.

The bottom line is that there will always be some regression
somewhere. I'm not sure what the guiding principle for when that
becomes unacceptable is, but you did seem sympathetic to the idea that
really low work_mem settings (e.g. 4MB) with really big inputs were
not too concerning [2]. I'm emphasizing Jeff's case now because I,
like you [2], am much more worried about maintenance_work_mem default
cases with regressions than anything else, and that *was* such a case.

Like Jeff Janes, I don't care about his other regression of about 5%
[3], which involved a 4MB work_mem + 100 million tuples. The other
case (the one I do care about) was 64MB + 400 million tuples, and was
a much bigger regression, which is suggestive of the unpredictable
nature of problems with polyphase merge scheduling that Jeff talked
about. Maybe we just got unlucky there, but that risk should not blind
us to the fact that overwhelmingly, replacement selection is the wrong
thing.

I'm sorry that I've reversed myself like this, Robert, but I'm just
not seeing a benefit to what we talked about, but I do see a cost.

[1] http://www.postgresql.org/message-id/CAMkU=1zKBOzkX-nqE-kJFFMyNm2hMGYL9AsKDEUHhwXASsJEbg@mail.gmail.com
[2] http://www.postgresql.org/message-id/CA+TgmoZGFt6BAxW9fYOn82VAf1u=V0ZZx3bXMs79phjg_9NYjQ@mail.gmail.com
[3] http://www.postgresql.org/message-id/CAM3SWZTYneCG1oZiPwRU=J6ks+VpRxt2Da1ZMmqFBrd5jaSJSA@mail.gmail.com

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2016-02-15 05:11:02 Re: Support for N synchronous standby servers - take 2
Previous Message Michael Paquier 2016-02-15 02:05:52 Re: WIP: SCRAM authentication