From: | Richard Huxton <dev(at)archonet(dot)com> |
---|---|
To: | Mark Jensen <musicnyman1974(at)yahoo(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Including unique users in huge data warehouse in Postgresql... |
Date: | 2006-11-29 19:29:35 |
Message-ID: | 456DDF9F.30207@archonet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Mark Jensen wrote:
> So i've been given the task of designing a data warehouse in
> either Postgresql or Mysql for our clickstream data for our sites. I
> started with Mysql but the joins in Mysql are just way too slow
> compared to Postgresql when playing with star schemas.
Mark - it's not my usual area, but no-one else has picked up your
posting, so I'll poke my nose in. The other thing you might want to do
is post this on the performance list - that's probably the best place.
Might be worth talking to those at www.bizgres.org too (although I think
they all hang out on the performance list).
> I can't say
> which sites i'm working on, but we get close to 3-5 million uniques
> users per day, so over time, that's a lot of unique users to keep
> around and de-dup your fact tables by. Need to be able to query normal
> analytics like:
<snip>
> i've
> made a lot of optimizations in postgresql.conf by playing with work_mem
> and shared_buffers and such and i think the database is using as much
> as it can disk/memory/cpu wise.
Big work_mem, I'm guessing. Limiting factor is presumably disk I/O.
<snip>
> here's a sample query that takes a while to run... just a simple report that shows gender by area of the site.
>
> select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as users
> from uus as A, areas as B, daily_area_fact as C
> where A.uu_id = C.uu_id
> and B.area_id = C.area_id
> group by gender,area;
>
> so
> by just having one day of data, with 3,168,049 rows in the user
> dimension table (uus), 17,213,420 in the daily_area_fact table that
> joins all the dimension tables, takes about 15 minutes. if i had 30-90
> days in this fact table, who knows how long this would take... i know
> doing a distinct on uu_id is very expensive, so that's the main problem
> here i guess and would want to know if anyone else is doing it this way
> or better.
In the end, I'd suspect the seq-scan over the fact table will be your
biggest problem. Can you pre-aggregate your fact-table into daily summaries?
See you over on the performance list, where there are more experienced
people than myself to help you.
--
Richard Huxton
Archonet Ltd
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2006-11-29 19:32:54 | Re: coalesce with all nulls can only be assigned to |
Previous Message | Tomi N/A | 2006-11-29 19:28:04 | Re: Only MONO/WinForms is a way to go |