Re: pgsql and large tables

From: Francisco Reyes <lists(at)natserv(dot)com>
To: Gurupartap Davis <partap(at)yahoo(dot)com>
Cc: pgsql General List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pgsql and large tables
Date: 2001-11-15 23:33:17
Message-ID: 20011115180851.X60926-100000@zoraida.natserv.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 15 Nov 2001, Gurupartap Davis wrote:

> Thanks for trying to help me out...I've got some more info for you:

Not a problem, but remember to include the list. I am but a relatively new
users and you would deprive yourself from the help of others more
experienced if you don't cc the list.

> I'm migrating this table from an existing mysql installation...
> This is what it looks like right now:
> Table "forecast"
> Attribute | Type | Modifier
> --------------+-----------------------+----------
> zhr | smallint |
> zday | smallint |
> model | character varying(32) |
> temp | numeric(6,2) |
> modelhr | smallint | not null
> modelruntime | smallint | not null
> modelrundate | smallint |
> stn | character(4) | not null
> rh | numeric(6,2) |
> wdsp | numeric(6,2) |
> wddir | character varying(2) |
> dwpt | numeric(6,2) |
> lpre | numeric(6,2) |
> yearmoda | date | not null
>
> It's a table for weather forecasts, a record is identified uniquely by
> (model, stn, yearmoda, modelruntime, modelhr) although I will rarely have a
> query that specifies all of those fields.

What are your most heavily looked upon columns and do you have indexes on
them?

>The reason for the permuted indexes mentioned above

Are those 5 columns what make a record unique?
I see it strange that all of the fields that you described as your primary
key you didn't put not null on them. In particular model. What are zhr and
zday?

>is because mysql will use a prefix of a multi-column
>key to narrow down a search. I guess pgsql doesn't use indexes that way?

Don't know. I usually use Index according to my needs. If I need to search
a lot on 3 columns then I index those 3 columns. I don't see the point on
adding columns to an index if it will rarely be used.

>(I noticed in the to-do list something like "reenable partial indexes")

So far I have not tried to do searches on partial index keys.

> The cardinality on these by the way, is approximately: model: 15-30,
> stn:1800, yearmoda: ~450 so far..., modelruntime: 4, and
> modelhr:10-40...Does this make any difference on what order they should be
> listed in the index?

I think that on a compound index then probably not. If you have something
which is a limiting factor then you may want to have that by itself on an
index. In general, not only on pgsql, anything which would limit your
searches is usually to have that on it's own index. I don't know much
about your data, but as an example if you work a lot by range of dates
then having an index on date would be a good index to have.

> So, I should add a serial id column, I guess, and make that the primary key
That is one approach and I still don't understand enough your data to say
this is actually good.

As a theoretical example let's say that you have a an index like you
suggest below model, stn, yearmoda and modelruntime, these could be on a
model table and then have a serial key on those 4. On the other table
with the rest of the info you only have the key instead of those 4 fields.
This is basically normalization of your data. It's advantage is limiting
how much I/O you need to do. Are you familiar with normalization? Don't
want to bore you with stuff you may know.

>Now I need some useful indexes. Most of my queries will be for a
>complete model run at a particular station, so I'm thinking of an index on
>(model, stn, yearmoda, modelruntime) to return all modelhr's at a stn ...

When you do your DB design you not only want to look at what makes it easy
to use, but what makes it efficient. This is where normalization usually
helps. Although I rarelly ever normalize formally, I usually have some of
it's concepts used depeding on what I want to achieve.

>Another common set of queries deals with a *complete* model run (all stn's),
>so I would need another index on (model, yearmoda, modelruntime).
> Yet another useful query type contains all model runs within a certain date
> range, aggregated by modelhr...it looks to me like the above 2 indexes might
> be sufficient for that, though....

I advise you try to think if there is anything which can limit your
queries. Any factor in common on all these queries. If such limiting
factor exists then you could have this on a separate, smaller, table.

> ...or would I need an index that references modelhr for that one???

It is difficult for me to try to understand your data, since I don't know
almost anything about the meaning of the fields and how you will access
these columns.

> Sometimes I will need to re-import a forecast because something went wrong,
> and all the values are bogus...can I do a REPLACE into a table, specifying

Postgresql Doesn't have a REPLACE, but what you do is that you "begin" a
transaction, delete all the old data, re-insert the model and then "end"
the transaction.

> How exactly is having a primary key as serial going to help me here?

If you can have some sort of "parent" table with some data which is a
"key" and that you will use it as your starting point. Having a smaller
table with the right indexes can greatly help your queries due to smaller
I/O needing to be done.

> The 40 million rows, by the way, is for about 2 months of data ;-)...we've
> got about another year of data to import (and will be growing by about 1.5
> million rows a day) , so I want to make sure I've got the table and indexes
> set up optimally, first, since it will take about a week to import all the
> data.

When you talk about such sizes you need to look at your hardware and at
the optimizations you have done with PostgreSQL, or any database for that
matter.

> It is currently on a P3 700MHz machine, running RedHat 7.2 and PostgreSQL
> 7.1.3, with 256MB of RAM. We have a dual P3 700 with 512MB that we could
> move it to for production(the mysql db is currently running on it)

Memory is super cheap nowadays, about $30 for 256MB if memory serves me
right. Increase the memory on these machines. Are you using IDE or SCSI?
How many HDs? Using any type of RAID? Have you increased your buffers on
PostgreSQL? Have you looked at your shared memory settings? Are you using
explain with all your queries to see if they are using your indexes?

You also mentioned issues with downtime. Is this DB going to be used
24x7? You need to do vacuum analyze at least after every big update.

How often will your data be updated? Once data is loaded will it be
changed at all?

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Risko Peter 2001-11-15 23:40:06 Re: index on numbers not honoured
Previous Message Doug McNaught 2001-11-15 23:17:28 Re: Bug?