Re: Sorting Improvements for 8.4

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Sorting Improvements for 8.4
Date: 2007-12-03 22:54:18
Message-ID: 1196722458.22428.492.camel@dogma.ljc.laika.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2007-12-03 at 20:40 +0000, Gregory Stark wrote:
> I'm not sure where the idea of keeping the current bounds of the input tapes
> comes into it. We only preread when we run out of tuples anyways and then we
> don't really have a choice about which tape we want to preread from. And it's
> a good thing too since maintaining such a list of bounds and finding the
> lowest or highest would mean maintaining a second heap which would basically
> double the cpu cost of sorting.
>

You're only keeping track of the maximum value for each run, which
should be cheap to track. The only time it changes is when you're
reading more data from that run, in which case it increases.

The tradeoff that's happening right now is: we want to merge many runs
at once because it reduces the number of merge phases, but the problem
is that it increases the seeking because we read one block from one run,
then one block from another run, etc., especially if the input is
random.

If we reduce the number of runs, then we can preread more efficiently.
See:

tuplesort.c:

* as sorted runs, we can eliminate any repeated I/O at all. In the
current
* code we determine the number of tapes M on the basis of workMem: we
want
* workMem/M to be large enough that we read a fair amount of data each
time
* we preread from a tape, so as to maintain the locality of access
described
* above. Nonetheless, with large workMem we can have many tapes.

So, for workMem/M to be "large enough", M has to be small enough. But a
small M means we have to do more merge phases, which is expensive.

Forecasting improves this trade. Forecasting no longer _blindly_
prereads from each tape, it uses information that it already has (the
max value of the last block read from each run) to determine the runs
from which we need tuples the soonest.

Then, it prereads the _correct_ data.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Devrim GÜNDÜZ 2007-12-03 22:58:33 Re: Is postgres.gif missing in cvs?
Previous Message Tom Lane 2007-12-03 22:49:37 Re: Is postgres.gif missing in cvs?