From: | Greg Stark <gsstark(at)mit(dot)edu> |
---|---|
To: | Josh Kupershmidt <schmiddy(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: PG-related ACM Article: "The Pathologies of Big Data" |
Date: | 2009-08-08 00:28:20 |
Message-ID: | 407d949e0908071728y17ed4b31g7b95e2be5fce4924@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidt<schmiddy(at)gmail(dot)com> wrote:
> Just stumbled across this recent article published in the
> Communications of the ACM:
>
> http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext
>
> The author shares some insights relating to difficulties processing a
> 6.75 billion-row
> table, a dummy table representing census-type data for everyone on earth, in
> Postgres.
>
> I'd really like to replicate the author's experiment, but it's not clear from
> the article what his table definition looks like. He claims to be using a
> 16-byte record to store the several columns he needs for each row, so perhaps
> he's using a user-defined type?
or four integers, or who knows. Postgres's per-row overhead is 24
bytes plus a 16-bit line pointer so you're talking about 42 bytes per
row. There's per-page overhead and alignment but in this case it
shouldn't be much.
> The author implies with his definition of "big data" that the dataset he
> analyzed is "... too large to be placed in a relational database... ". From
> Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
> run on 6.75 billion rows. This amount of time for the query didn't seem
> surprising to me given how many rows he has to process, but in a recent post
> on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
> ran the same SELECT query in 105 minutes! This would be very impressive (a
> 10-fold improvement over Postgres) if true.
6.75 billion * 42 bytes is 283.5GB.
Assuming you stick that on a single spindle capable of 100MB/s:
You have: 283.5GB / (100MB/s)
You want: min
* 47.25
So something's not adding up.
> One intriguing tidbit I picked up from the article: "in modern systems, as
> demonstrated in the figure, random access to memory is typically slower than
> sequential access to disk." In hindsight, this seems plausible (since modern
> disks can sustain sequential reads at well over 100MB/sec).
Sure, but the slowest PCIe bus can sustain 1GB/s and your memory
bandwidth is probably at least 8GB/s.
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2009-08-08 00:40:55 | Re: PG-related ACM Article: "The Pathologies of Big Data" |
Previous Message | Culley Harrelson | 2009-08-07 21:24:21 | Need suggestions on kernel settings for dedicated FreeBSD/Postgresql machine |