Re: PG-related ACM Article: "The Pathologies of Big Data"

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: PG-related ACM Article: "The Pathologies of Big Data"
Date: 2009-08-08 00:28:20
Message-ID: 407d949e0908071728y17ed4b31g7b95e2be5fce4924@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidt<schmiddy(at)gmail(dot)com> wrote:
> Just stumbled across this recent article published in the
> Communications of the ACM:
>
> http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext
>
> The author shares some insights relating to difficulties processing a
> 6.75 billion-row
> table, a dummy table representing census-type data for everyone on earth, in
> Postgres.
>
> I'd really like to replicate the author's experiment, but it's not clear from
> the article what his table definition looks like. He claims to be using a
> 16-byte record to store the several columns he needs for each row, so perhaps
> he's using a user-defined type?

or four integers, or who knows. Postgres's per-row overhead is 24
bytes plus a 16-bit line pointer so you're talking about 42 bytes per
row. There's per-page overhead and alignment but in this case it
shouldn't be much.

> The author implies with his definition of "big data" that the dataset he
> analyzed is "... too large to be placed in a relational database... ". From
> Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
> run on 6.75 billion rows. This amount of time for the query didn't seem
> surprising to me given how many rows he has to process, but in a recent post
> on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
> ran the same SELECT query in 105 minutes! This would be very impressive (a
> 10-fold improvement over Postgres) if true.

6.75 billion * 42 bytes is 283.5GB.

Assuming you stick that on a single spindle capable of 100MB/s:

You have: 283.5GB / (100MB/s)
You want: min
* 47.25

So something's not adding up.

> One intriguing tidbit I picked up from the article: "in modern systems, as
> demonstrated in the figure, random access to memory is typically slower than
> sequential access to disk." In hindsight, this seems plausible (since modern
> disks can sustain sequential reads at well over 100MB/sec).

Sure, but the slowest PCIe bus can sustain 1GB/s and your memory
bandwidth is probably at least 8GB/s.

--
greg
http://mit.edu/~gsstark/resume.pdf

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2009-08-08 00:40:55 Re: PG-related ACM Article: "The Pathologies of Big Data"
Previous Message Culley Harrelson 2009-08-07 21:24:21 Need suggestions on kernel settings for dedicated FreeBSD/Postgresql machine