From: | "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com> |
---|---|
To: | "Ow Mun Heng" <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Q:Aggregrating Weekly Production Data. How do you do it? |
Date: | 2007-09-19 15:06:36 |
Message-ID: | dcc563d10709190806w19c185c4s3ef57b0248fe0fb1@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 9/18/07, Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> wrote:
> On Tue, 2007-09-18 at 09:55 -0500, Scott Marlowe wrote:
> > On 9/17/07, Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> wrote:
> >
> > > Just wondering how everyone is doing aggregration of production data.
> >
> > Where I work, we aggregate by the minute in the application, then dump
> > those stats entries into the database en masse from each machine. So,
> [snip]
> > The fail1/2/3 are types of failures, the totalresptime
> > time is all the successful requests added together, so that
> > totalresptime/success = average for that minute, and max is the
> > longest running request ending in that minute.
>
> So, it's basicaly a summation rather than using avg() all the way.
> That would mean using something like bigint or something to deal with
> the large numbers
Yeah. We use numeric(18) (bonus points if anyone knows what other
database I'm pumping data out of to get a numeric(18) as a default.)
The advantage is that I don't have to re-weight my measurements. When
the system was first designed, it stored avg response and number of
responses, which I had to then multiply to get the original total time
and process to get a weighted average. This method removes the
averaging step in the client software AND the multiplying step to get
the raw numbers back on my end, win-win.
> > > (getting min/max/count isn't much of an issue. Stdev is the main issue I
> > > believe)
> > >
> > > One such instance I've read about is..
> >
> > Isn't stddev() working for you?
>
> Stdev is based on a population of data. if I do slices and if I want to
> stdev across 2 months (8 weeks eg) then it would be wrong.
But if you ask for the data set grouped by
date_trunc('minute',timestmp) in an inner select, then in the outer
select you can do a stddev on that set and get it. Or is that not
accurate?
> > What I do is aggregate the minute time slices by grouping by
> > date_trunc('xxx',timestamp) and then use that as a subselect to a
> > query that does the stddev() outside of that. works pretty well for
> > us, and makes it easy to identify trends.
>
> Trending analysis is very critical and the only reference I've found on
> how to get "stdev" is based on what I posted.
Well, what we're doing seems to be getting us what look like proper numbers.
> > One of the things we started doing is to aggregate the last six weeks
> > data by the day / hour and then comparing the last 24 hours worth of
> > data to those six weeks worth to see trends for each hour of each day.
> > The queries are huge and ugly, but they work, and run in about 4
> > minutes on a fairly good sized chunk of data. We have about 150k to
> > 300k entries a day put into this db.
>
> I'm not sure how many K entries in a day(yet to count it) but I'm
> getting into trouble w/ one-2-many relationships and PG is choosing to
> do nested loops etc. (lots of left joins on same tables)
Yeah, for this kind of thing, you want to pre-process your tables into
one big flat table if you can. We do have one to many relationships
within the same table (parent / child system requests are one to many)
and when we join the table to itself several times it can get hairy.
We had some nested loop problems until I upped the stats target on a
few of the columns and scheduled regular analyzes for the main table.
Now the system picks the right plan 99.9% or more of the time. And
when it doesn't, it's usually something fresh and new that's the
issue.
> So, I've to resort to using SRF and function scans. a >2 hour Complex
> query (on my laptopn,PG, runs in 20min on high end mssql) is reduced to
> ~60secs.
That's an excellent improvement!
> Thanks for the response.
>
> BTW, are user queries an issue? Does it interfere with the loading.
Not usually. I've got them trained pretty well, and the system has
lots more I/O bandwidth than CPU horsepower, so generally a long
running query just waits on I/O. RAID-10 has made this system much
faster than the old single drive workstation (mine) it was running on.
The data loading on this system runs once a minute and it pumps the
data out of the production db, and puts it into the reporting db.
Average run time to insert those rows is < 1 second.
> (That's one of my concerns over here). I noticed that you do the calc
> at the app before sending it to the DB, which is a good thing and every
> minute too. (is it adviseable to do it in chunks of 1min? less data per
> minute vs per 30min/1 hour )
exactly. Each minute is easy to keep track of, and if the app goes
crazy for a bit and loses the data / gets restarted, you only lose a
minutes or so worth of data.
Also, we use this system for real time alerting and performance
monitoring. If some request or other starts taking too long or the
failure rate shoots up, it generates alerts. If it was collected
every 30 minutes or an hour that would be far too late.
From | Date | Subject | |
---|---|---|---|
Next Message | Erik Jones | 2007-09-19 15:17:00 | Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER |
Previous Message | Scott Marlowe | 2007-09-19 14:55:12 | Re: Is this good spec for a PostgreSQL server? |