Re: Help with bulk read performance

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Daniel(dot)S(dot)Schaffer(at)noaa(dot)gov
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Help with bulk read performance
Date: 2010-12-14 15:27:19
Message-ID: 4D078CD7.7030308@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 11/1/2010 9:15 AM, Dan Schaffer wrote:
> Hello
>
> We have an application that needs to do bulk reads of ENTIRE Postgres
> tables very quickly (i.e. select * from table). We have observed that
> such sequential scans run two orders of magnitude slower than observed
> raw disk reads (5 MB/s versus 100 MB/s). Part of this is due to the
> storage overhead we have observed in Postgres. In the example below, it
> takes 1 GB to store 350 MB of nominal data. However that suggests we
> would expect to get 35 MB/s bulk read rates.
>
> Observations using iostat and top during these bulk reads suggest that
> the queries are CPU bound, not I/O bound. In fact, repeating the queries
> yields similar response times. Presumably if it were an I/O issue the
> response times would be much shorter the second time through with the
> benefit of caching.
>
> We have tried these simple queries using psql, JDBC, pl/java stored
> procedures, and libpq. In all cases the client code ran on the same box
> as the server.
> We have experimented with Postgres 8.1, 8.3 and 9.0.
>
> We also tried playing around with some of the server tuning parameters
> such as shared_buffers to no avail.
>
> Here is uname -a for a machine we have tested on:
>
> Linux nevs-bdb1.fsl.noaa.gov 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20
> 07:12:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> A sample dataset that reproduces these results looks like the following
> (there are no indexes):
>
> Table "bulk_performance.counts"
> Column | Type | Modifiers
> --------+---------+-----------
> i1 | integer |
> i2 | integer |
> i3 | integer |
> i4 | integer |
>
> There are 22 million rows in this case.
>
> We HAVE observed that summation queries run considerably faster. In this
> case,
>
> select sum(i1), sum(i2), sum(i3), sum(i4) from bulk_performance.counts
>
> runs at 35 MB/s.
>
>
> Our business logic does operations on the resulting data such that the
> output is several orders of magnitude smaller than the input. So we had
> hoped that by putting our business logic into stored procedures (and
> thus drastically reducing the amount of data flowing to the client) our
> throughput would go way up. This did not happen.
>
> So our questions are as follows:
>
> Is there any way using stored procedures (maybe C code that calls SPI
> directly) or some other approach to get close to the expected 35 MB/s
> doing these bulk reads? Or is this the price we have to pay for using
> SQL instead of some NoSQL solution. (We actually tried Tokyo Cabinet and
> found it to perform quite well. However it does not measure up to
> Postgres in terms of replication, data interrogation, community support,
> acceptance, etc).
>
> Thanks
>
> Dan Schaffer
> Paul Hamer
> Nick Matheson
>
>
>
>

Whoa... Deja Vu

Is this the same thing Nick is working on? How'd he get along?

http://archives.postgresql.org/message-id/4CD1853F.2010800@noaa.gov

-Andy

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jim Nasby 2010-12-14 15:41:15 Re: Help with bulk read performance
Previous Message Plugge, Joe R. 2010-12-14 14:54:26 Re: Index Bloat - how to tell?