Re: PG-related ACM Article: "The Pathologies of Big Data"

From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: PG-related ACM Article: "The Pathologies of Big Data"
Date: 2009-08-08 00:40:55
Message-ID: dcc563d10908071740w26bef446nda22268635da9f7b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Aug 7, 2009 at 2:17 PM, Josh Kupershmidt<schmiddy(at)gmail(dot)com> wrote:
> Just stumbled across this recent article published in the
> Communications of the ACM:
>
> http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext
>
> The author shares some insights relating to difficulties processing a
> 6.75 billion-row
> table, a dummy table representing census-type data for everyone on earth, in
> Postgres.
>
> I'd really like to replicate the author's experiment, but it's not clear from
> the article what his table definition looks like. He claims to be using a
> 16-byte record to store the several columns he needs for each row, so perhaps
> he's using a user-defined type?
>
> The author implies with his definition of "big data" that the dataset he
> analyzed is "... too large to be placed in a relational database... ". From
> Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
> run on 6.75 billion rows. This amount of time for the query didn't seem
> surprising to me given how many rows he has to process, but in a recent post
> on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
> ran the same SELECT query in 105 minutes! This would be very impressive (a
> 10-fold improvement over Postgres) if true.

Well, from the article, I got the feeling he never showed up here on
the list to ask for help, and he just assumed he knew enough about
postgresql to say it couldn't scale well. I just checked the
archives, and his name doesn't show up.

When you look at his slides, this one makes we wonder about a few points:

http://deliveryimages.acm.org/10.1145/1540000/1536632/figs/f3.jpg

He was using 8 15kSAS in RAID-5. Just the fact that he's using RAID-5
to test makes me wonder, but for his mostly-read workload it's useful.
But on his machine he was only getting 53MB/second sequential reads?
That makes no sense. I was getting 50MB/s from a 4 disk SATA RAID on
older 120G hard drives years ago. SAS drives haven't been around that
long really, so I can't imagine having 7 disks (1 for parity) and only
getting 53/7 or 7.5MB/second from them. That's horrible. I had 9 Gig
5.25 full height drives faster than that back in the day, on eight bit
scsi controllers. His memory read speed was pretty bad too at only
350MB/s. I have a 12 drive RAID-10 that can outrun his memory reads.
So I tend to think his OS was setup poorly, or his hardware was
broken, or something like that.

> The author complained that "on larger tables [Postgres' planner] switched to
> sorting by grouping columns", which he blamed for the slow query execution. I
> don't personally see this plan as a problem, but maybe someone can enlighten
> me.

I'm sure that if he was on faster hardware it might have been quite a
bit faster. I'd love to try his test on a real server with RAID-10
and lots of memory. I'm certain I could get the run time down by a
couple factors.

I wonder if he cranked up work_mem? I wonder if he even upped shared_buffers?

> One intriguing tidbit I picked up from the article: "in modern systems, as
> demonstrated in the figure, random access to memory is typically slower than
> sequential access to disk." In hindsight, this seems plausible (since modern
> disks can sustain sequential reads at well over 100MB/sec).

This is generally always true. But his numbers are off by factors for
a modern system. Pentium IIs could sequentially read in the several
hundreds of megs per second from memory. Any modern piece of kit,
including my laptop, can do much much better than 350Meg/second from
memory.

I wonder if he'd make his work available to mess with, as it seems he
did a pretty poor job setting up his database server / OS for this
test. At the very least I wonder if he has a colleague on this list
who might point him to us so we can try to help him improve the dismal
performance he seems to be getting. Or maybe he could just google
"postgresql performance tuning" and take it from there...

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2009-08-08 00:42:30 Re: PG-related ACM Article: "The Pathologies of Big Data"
Previous Message Greg Stark 2009-08-08 00:28:20 Re: PG-related ACM Article: "The Pathologies of Big Data"