From: | "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com> |
---|---|
To: | Luke Lonergan <llonergan(at)greenplum(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Merge algorithms for large numbers of "tapes" |
Date: | 2006-03-09 01:31:39 |
Message-ID: | 20060309013139.GM45250@pervasive.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Mar 08, 2006 at 10:49:16AM -0800, Luke Lonergan wrote:
> Jim,
>
> On 3/8/06 9:49 AM, "Jim C. Nasby" <jnasby(at)pervasive(dot)com> wrote:
>
> > On Wed, Mar 08, 2006 at 11:20:50AM -0500, Tom Lane wrote:
>
> >> Not sure that follows. In particular, the entire point of the recent
> >> changes has been to extend the range in which we can use a single merge
> >> pass --- that is, write the data once as N sorted runs, then merge them
> >> in a single read pass. As soon as you have to do an actual merge-back-
> >> to-disk pass, your total I/O volume doubles, so there is definitely a
> >> considerable gain if that can be avoided. And a larger work_mem
> >> translates directly to fewer/longer sorted runs.
> >
> > But do fewer/longer sorted runs translate into not merging back to disk?
> > I thought that was controlled by if we had to be able to rewind the
> > result set.
>
> In the *tape* algorithm, there is an intermediate abstraction in the merging
> called tapes (!) that are used to store intermediate merge results. Simon's
> work implemented more tapes, which asymptotically approaches a single merge
> pass as the number of tapes approaches the number of runs.
>
> The Replacement Selection algorithm generally will produce about 1/2 the
> number of runs that a simpler partial sort algorithm would, and the more
> memory it uses, the fewer runs there are, and with fewer runs, fewer tapes
> are required to avoid more passes on the merge.
>
> This whole tape abstraction is something that I believe is unique to
> Postgres among modern databases, and we have found that by removing it
> entirely along with logtape.c, we remove 2000 lines of useless code that
> only complicates our optimization problem.
Oh, geez, I think I get it now. I was thinking that we did something
like sort a chunk, write it to disk, repeat until all data processed and
then just read from the stuff on disk in order, switching between files
as needed. But of course that would suck horribly if we were actually
using tapes. Like others have said, surely there's got to be a much
better way to go about things with more modern hardware. If there is,
then hopefully the possibility exists of returning memory back to "the
pool" if it's not going to be as useful as it would be to a sort that
would fit in-memory.
As an example, in my hypothetical algorithm that sorts one chunk at a
time and then bounces between chunks when reading the data back out, it
would probably be better to have fewer, larger chunks than many more
small ones. But the difference between 256M chunks and 1GB chunks
probably wouldn't be that big a difference, certainly not a 4x
improvement. So it makes sense to go with the smaller chunks if it means
that other sorts would be able to operate entirely in-memory. In an
ideal world, this allocation could even by dynamic, based on what else
was happening on the machine.
But I'll take any incremental improvement I can get right now. :) Just
having the ability to set a more aggressive work_mem without worrying
about causing a swap storm would be a huge improvement over the current
situation. Being able to cut back on memory use when we fall back to
disk would be icing on the cake. :)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
From | Date | Subject | |
---|---|---|---|
Next Message | Jonah H. Harris | 2006-03-09 01:43:39 | Re: Merge algorithms for large numbers of "tapes" |
Previous Message | Dann Corbit | 2006-03-09 00:18:39 | Re: Merge algorithms for large numbers of "tapes" |