Re: PG for DataWarehouse type Queries

From: Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: PG for DataWarehouse type Queries
Date: 2007-08-06 01:24:01
Message-ID: 1186363441.5418.9.camel@neuromancer.home.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 2007-08-03 at 07:55 -0600, Josh Tolley wrote:
> On 8/3/07, Ow Mun Heng <Ow(dot)Mun(dot)Heng(at)wdc(dot)com> wrote:
> > Can anyone shed some light on this. I just would like to know if
> queries
> > for raw data (not aggregregates) is expected to take a long time.
> > Running times between 30 - 2 hours for large dataset pulls.
> >
> > Involves lots of joins on very large tables (min 1 millon rows each
> > table, 300 columns per table)
> >
> > Joins are done in the form of Left joins (sometimes on the same tables,
> > due to normalisation)
> >
> > Is 30min - 2hours too long or is this considered "normal"??
>
> That depends entirely on your query, your dataset, the machine you're
> using, etc. Your best bet is probably to post an EXPLAIN ANALYZE of
> your query along with some description of what's going on and what
> you're running this all on (pg version, machine specs, etc.) and see
> if someone can come up with an acceptable way to make the query
> faster.

Sorry this generated more questions. I was vague cos it wasn't really
running on PG (yet)

Actual query is running on SQL Server. I'm considering offloading some
of those queries to a PG (i have 8.2.4 running for testing purposes)
server. (Very Low end desktop whitebox spec, eg: 1G Ram, 3x500GB drives
raid 1/1+0/0+1, PIV 2G, with the tables Pre-joined and de-normalised
every say 5 min and perform table partitioning based on date ranges)

Current queries in SQL server takes between 30min to 2 hours. (4G
mem/Quad Dual code Xeon)

Queries involves 11 Joins (of 5 Main tables, of which 3 are Large, 2 are
fact tables)

large tables has > 1min rows, fact ~max 200 rows)

Query is of type

Select
A.A,
A.B,
C.A,
D.D
from
(Select
...
)A
left outer join
(select ...
...
)B
on A.A = B.A
let outer join
(select
...
)C

and so on..

I'm hoping that by pre-joining them and using table partitioning, the
performance would greatly increase. (I'm still testing)

BTW, this question is just to have a feel if it's "normal" to have such
long running queries (time) for ETL type queries.

Sorry for the noise (since this is not really PG related yet)
I'm still researching the feasibility etc.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message hanasaki 2007-08-06 01:29:53 Re: HA, failover and load balancing / howto?
Previous Message 李彦 Ian Li 2007-08-06 01:15:31 Re: Postgres 8.2 binary for ubuntu 6.10?