Re: [PERFORM] A Better External Sort?

From: Ron Peacetree <rjpeace(at)earthlink(dot)net>
To: "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] A Better External Sort?
Date: 2005-09-28 16:03:34
Message-ID: 21402654.1127923414088.JavaMail.root@elwamui-polski.atl.sa.earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

>From: "Jeffrey W. Baker" <jwbaker(at)acm(dot)org>
>Sent: Sep 27, 2005 1:26 PM
>To: Ron Peacetree <rjpeace(at)earthlink(dot)net>
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
>
>>That Btree can be used to generate a physical reordering of the data
>>in one pass, but that's the weakest use for it. The more powerful
>>uses involve allowing the Btree to persist and using it for more
>>efficient re-searches or combining it with other such Btrees (either as
>>a step in task distribution across multiple CPUs or as a more efficient
>>way to do things like joins by manipulating these Btrees rather than
>>the actual records.)
>
>Maybe you could describe some concrete use cases. I can see what
>you are getting at, and I can imagine some advantageous uses, but
>I'd like to know what you are thinking.
>
1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set. Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables. We now have these
very efficient representations of a specific ordering on these tables. A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3= We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch. When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made. Sometimes it will be
cheaper to do the sort from scratch using the composite Key.

>Specifically I'd like to see some cases where this would beat sequential
>scan. I'm thinking that in your example of a terabyte table with a
>column having only two values, all the queries I can think of would be
>better served with a sequential scan.
>
In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, => 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.

Just to clarify the point further,
1TB of 1B records => 2^40 records of at most 256 distinct values.
1TB of 2B records => 2^39 records of at most 2^16 distinct values.
1TB of 4B records => 2^38 records of at most 2^32 distinct values.
1TB of 5B records => 200B records of at most 200B distinct values.
>From here on, the number of possible distinct values is limited by the
number of records.
100B records are used in the "Indy" version of Jim Gray's sorting
contests, so 1TB => 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record. Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?

>Perhaps I believe this because you can now buy as much sequential I/O
>as you want. Random I/O is the only real savings.
>
1= No, you can not "buy as much sequential IO as you want". Even if
with an infinite budget, there are physical and engineering limits. Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains. So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget. Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself? After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation. The fastest and cheapest IO you can
do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-09-28 16:04:36 Re: Proposed patch for sequence-renaming problems
Previous Message Tom Lane 2005-09-28 15:53:13 Re: [PATCHES] Proposed patch for sequence-renaming problems

Browse pgsql-performance by date

  From Date Subject
Next Message Dan Harris 2005-09-28 17:22:18 Re: Monitoring Postgresql performance
Previous Message Arnau 2005-09-28 14:32:02 Monitoring Postgresql performance