Re: *very* slow query to summarize data for a month ...

From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: *very* slow query to summarize data for a month ...
Date: 2003-11-12 05:52:44
Message-ID: 87oevii15f.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


"Marc G. Fournier" <scrappy(at)postgresql(dot)org> writes:

> Just as a side note, just doing a straight scan for the records, with no
> SUM()/GROUP BY involved, with the month_trunc() index is still >8k msec:

Well so the problem isn't the query at all, you just have too much data to
massage online. You can preprocess the data offline into a more managable
amount of data for your online reports.

What I used to do for a similar situation was to do hourly queries sort of
like this:

insert into data_aggregate (day, hour, company_id, total_bytes)
(select trunc(now(),'day'), trunc(now(), 'hour'), company_id, sum(bytes)
from raw_data
where time between trunc(now(),'hour') and trunc(now(),'hour')+'1 hour'::interval
group by company_id
)

[this was actually on oracle and the data looked kind of different, i'm making
this up as i go along]

Then later the reports could run quickly based on data_aggregate instead of
slowly based on the much larger data set accumulated by the minute. Once I had
this schema set up it was easy to follow it for all of the rapidly growing
data tables.

Now in my situation I had thousands of records accumulating per second, so
hourly was already a big win. I originally chose hourly because I thought I
might want time-of-day reports but that never panned out. On the other hand it
was a win when the system broke once because I could easily see that and fix
it before midnight when it would have actually mattered. Perhaps in your
situation you would want daily aggregates or something else.

One of the other advantages of these aggregate tables was that we could purge
the old data much sooner with much less resistance from the business. Since
the reports were all still available and a lot of ad-hoc queries could still
be done without the raw data anyways.

Alternatively you can just give up on online reports. Eventually you'll have
some query that takes way more than 8s anyways. You can pregenerate the entire
report as a batch job instead. Either send it off as a nightly e-mail, store
it as an html or csv file for the web server, or (my favourite) store the data
for the report as an sql table and then have multiple front-ends that do a
simple "select *" to pull the data and format it.

--
greg

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shridhar Daithankar 2003-11-12 08:42:54 Re: Value of Quad vs. Dual Processor machine
Previous Message Fred Moyer 2003-11-12 03:17:40 Re: Value of Quad vs. Dual Processor machine