Re: [PERFORM] A Better External Sort?

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Zeugswetter Andreas DAZ SD" <ZeugswetterA(at)spardat(dot)at>, "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PERFORM] A Better External Sort?
Date: 2005-10-06 17:28:38
Message-ID: BF6AACD6.10C1A%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andreas,

On 10/6/05 3:56 AM, "Zeugswetter Andreas DAZ SD" <ZeugswetterA(at)spardat(dot)at>
wrote:

> pg relys on the OS readahead (== larger block IO) to do efficient IO.
> Basically the pg scan performance should match a dd if=file of=/dev/null
> bs=8k,
> unless CPU bound.

Which it is. Postgres will currently do a maximum of 120MB/s scan rate on
modern CPUs, no matter how fast the underlying I/O subsystem is. Several
people on this list have confirmed that by posting 100MB/s numbers on faster
subsystems. Others have posted that Oracle/DB2/etc run at near disk rates.
IMO, most of the problem is that there are too many small (tuple, single
block) operations in the I/O path and they need to be optimized out.

The first step is recognition that there is a problem worth solving, and it
seems that there are now many who have recognized the issue of poor I/O path
optimization in Postgres. I think a prioritized short list might look like
this:

- Sort performance
- The implementation of a possibly fine algorithm yields very poor use of
the underlying hardware. It needs to be optimized to remove redundant I/O
operations and possibly (with different algorithm?) to make better use of
available RAM. This would be practice for the next item on the list:

- Scan performance
- Someone needs to trace the life of a tuple through the executor and
implement better buffering at each stage of the process. There was a start
at this, but it only buffered in one node. Yes, I know we have a shared
memory buffer, but it acts as a final destination, this is about the path
through the executor.
- I suspect that after profiling (with system routines included), we will
find much more tuple-scale work that is interfering with the optimal flow of
tuples through the executor. It's unlikely that the problem is isolated to
the COUNT() aggregate, but is present in many other nodes. Redundant
allocation/free of memory on a per tuple basis should be replaced with more
persistent buffers on a page or larger basis.
- Ultimately, after working through the above two issues, we will reach a
point of diminishing returns where async I/O is needed to be able to remove
producer/consumer problems in a single threaded executor. This can be
implemented using sliced processes or threading, or by using aio. I believe
that even without async I/O, we should be seeing scan rates of 200-400 MB/s
on modern CPUs for the simple COUNT() aggregation pattern though.

FWIW,

- Luke

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2005-10-06 19:05:46 Re: prefix btree implementation
Previous Message Hans-Jürgen Schönig 2005-10-06 17:13:56 comments on prepared transactions ...