Re: Merge algorithms for large numbers of "tapes"

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Stephen Frost" <sfrost(at)snowman(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Merge algorithms for large numbers of "tapes"
Date: 2006-03-10 00:04:48
Message-ID: C03602A0.1EE2B%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Stephen,

On 3/9/06 3:48 PM, "Stephen Frost" <sfrost(at)snowman(dot)net> wrote:

> So, if we get a huge performance increase, what's wrong with:
> if [ sqrt(est(total)) <= work_mem ]; then
> two-pass-sort();
> else
> tape-sort();
> fi

I have something similar but less complex in mind.

One of the observed behaviors with the current approach is that increasing
work_mem actually slows external sorting down. This is because the heapsort
embedded in the replacement selection algorithm in the tape sort is not L2
cache friendly.

The easiest, simplest algorithm to employ here would be to quicksort in
chunks of work_mem to produce the runs, output them in a simple manner to
heap files, then merge them in one pass, materializing if necessary for
random access.

Granted there are seek optimizations necessary to make the merge pass
efficient, but these are obviously tractable in a simple manner as evidenced
by others (Nyquist) and our own internal experiments.

The simplicity of this is that the current approach switches from a
quicksort to the polyphase tape sort when work_mem is exceeded, which
involves a fairly complex chunk of code right now. In this new approach,
when the sort set exceeds work_mem, we just write it out and continue.

> If the intent is to remove it and then ask for the default work_mem to
> be increased- I doubt going about it this way would work very well. :)

Yep - the main question to address is whether work_mem is always sufficient
to buffer the merge results in one pass, or whether degenerating to a
multi-pass can be done gracefully if not.

Tim Kordas here plans to work on this sometime next week using code he's
already written, and I'd expect a pretty quick set of improvements through
this simplified approach.

- Luke

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Luke Lonergan 2006-03-10 00:07:05 Re: Merge algorithms for large numbers of "tapes"
Previous Message Tom Lane 2006-03-09 23:59:42 Re: Merge algorithms for large numbers of "tapes"