Re: pgsql and large tables

From: "Gurupartap Davis" <partap(at)yahoo(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: pgsql and large tables
Date: 2001-11-15 22:57:04
Message-ID: 014001c16e28$d8451dc0$0f00a8c0@marlows
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

err..forgot to cc the list...

--------------------------------------
Thanks for trying to help me out...I've got some more info for you:

> > I've got a large database, currently about 40 million rows in the
biggest table.
> ...
>
> > I initially created the table and with a primary key (5 fields: char(4),
varchar(32), date, int, int)
>
> Out of curiosity why are you using this as your primary key and not a
> serial? Will your tables be normalized?
>
> >and a couple additional indexes, one that just changes the order of the
> >primary key, and one that sorts on the date.
>
> >From whawt I have read so far your primary key doesn't sound like what
you
> would want to have as a primary key... but I guess the design of your
> table is a totally different issue.

I'm migrating this table from an existing mysql installation...
This is what it looks like right now:
Table "forecast"
Attribute | Type | Modifier
--------------+-----------------------+----------
zhr | smallint |
zday | smallint |
model | character varying(32) |
temp | numeric(6,2) |
modelhr | smallint | not null
modelruntime | smallint | not null
modelrundate | smallint |
stn | character(4) | not null
rh | numeric(6,2) |
wdsp | numeric(6,2) |
wddir | character varying(2) |
dwpt | numeric(6,2) |
lpre | numeric(6,2) |
yearmoda | date | not null

It's a table for weather forecasts, a record is identified uniquely by
(model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a
query that specifies all of those fields. The reason for the permuted
indexes mentioned above is because mysql will use a prefix of a multi-column
key to narrow down a search. I guess pgsql doesn't use indexes that way?
(I noticed in the to-do list something like "reenable partial indexes")

The cardinality on these by the way, is approximately: model: 15-30,
stn:1800, yearmoda: ~450 so far..., modelruntime: 4, and
modelhr:10-40...Does this make any difference on what order they should be
listed in the index?

So, I should add a serial id column, I guess, and make that the primary key
(why is this? I'm wondering, since I will most likely never refer to that
column). Now I need some useful indexes. Most of my queries will be for a
complete model run at a particular station, so I'm thinking of an index on
(model, stn, yearmoda, modelruntime) to return all modelhr's at a stn ...
Another common set of queries deals with a *complete* model run (all stn's),
so I would need another index on (model, yearmoda, modelruntime).
Yet another useful query type contains all model runs within a certain date
range, aggregated by modelhr...it looks to me like the above 2 indexes might
be sufficient for that, though....

eg: SELECT avg(temp) from forecast WHERE model='avn' and stn='KDRO' and
yearmoda >= $date1 and yearmoda <= $date2 and modelruntime=0 GROUP BY
modelhr...

...or would I need an index that references modelhr for that one???

Oh, and I guess I still need a UNIQUE index, as well. Hmm. Now I'm
wondering again if that serial id column is going to mess with me.
Sometimes I will need to re-import a forecast because something went wrong,
and all the values are bogus...can I do a REPLACE into a table, specifying
all of the columns of a UNIQUE index, without specifying the primary key?
How exactly is having a primary key as serial going to help me here? (Sorry
if this is a dumb question, I'm kinda new to this)

> >Should I make an artificial primary key with a serial type to simplify
things?
>
> I recommend you do this. Not only because it is more efficient, but
> because the type of key you selected has "issues" for lack of a better
> term. You can of course have a "unique" index so you don't have dups.
>
> I think that when you are dealing with a 40 million table you need to
> consider your hardware. You also didn't tell us how big are the rows.
> The more info you give us the more others will be able to help.

The 40 million rows, by the way, is for about 2 months of data ;-)...we've
got about another year of data to import (and will be growing by about 1.5
million rows a day) , so I want to make sure I've got the table and indexes
set up optimally, first, since it will take about a week to import all the
data.

It is currently on a P3 700MHz machine, running RedHat 7.2 and PostgreSQL
7.1.3, with 256MB of RAM. We have a dual P3 700 with 512MB that we could
move it to for production(the mysql db is currently running on it)

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2001-11-15 23:01:09 Re: Delete performance
Previous Message Tom Lane 2001-11-15 22:54:26 Re: Create Rule