From: | Hrishikesh (हृषीकेश मेहेंदळे) <hashinclude(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Performance issues with large amounts of time-series data |
Date: | 2009-08-26 18:39:40 |
Message-ID: | a7c00d4b0908261139l7a12990bx7ee4f6324723e02b@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi Tom,
Thanks for your quick response.
2009/8/26 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> <hashinclude(at)gmail(dot)com> writes:
> > In my timing tests, the performance of PG is quite a lot worse than the
> > equivalent BerkeleyDB implementation.
>
> Are you actually comparing apples to apples? I don't recall that BDB
> has any built-in aggregation functionality. It looks to me like you've
> moved some work out of the client into the database.
I'm measuring end-to-end time, which includes the in-code aggregation
with BDB (post DB fetch) and the in-query aggregation in PG.
> > 1. Is there anything I can do to speed up performance for the queries?
>
> Do the data columns have to be bigint, or would int be enough to hold
> the expected range? SUM(bigint) is a *lot* slower than SUM(int),
> because the former has to use "numeric" arithmetic whereas the latter
> can sum in bigint. If you want to keep the data on-disk as bigint,
> but you know the particular values being summed here are not that
> big, you could cast in the query (SUM(data_1::int) etc).
For the 300-sec tables I probably can drop it to an integer, but for
3600 and 86400 tables (1 hr, 1 day) will probably need to be BIGINTs.
However, given that I'm on a 64-bit platform (sorry if I didn't
mention it earlier), does it make that much of a difference? How does
a float ("REAL") compare in terms of SUM()s ?
> I'm also wondering if you've done something to force indexscans to be
> used. If I'm interpreting things correctly, some of these scans are
> traversing all/most of a partition and would be better off as seqscans.
One thing I noticed is that if I specify what devices I want the data
for (specifically, all of them, listed out as DEVICE IN (1,2,3,4,5...)
in the WHERE clause, PG uses a Bitmap heap scan, while if I don't
specify the list (which still gives me data for all the devices), PG
uses a sequential scan. (I might have missed the DEVICE IN (...) in my
earlier query). However, more often than not, the query _will_ be of
the form DEVICE IN (...). If I actually execute the queries (on the
psql command line), their runtimes are about the same (15s vs 16s)
> > shared_buffers = 128MB
>
> This is really quite lame for the size of machine and database you've
> got. Consider knocking it up to 1GB or so.
OK, I've bumped it up to 1 GB. However, that doesn't seem to make a
huge difference (unless I need to do the same on libpqxx's connection
object too).
Cheers,
Hrishi
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2009-08-26 18:52:14 | Re: Performance issues with large amounts of time-series data |
Previous Message | Tom Lane | 2009-08-26 18:01:43 | Re: Performance issues with large amounts of time-series data |