very large tables

From: Ramiro Diaz Trepat <ramiro(at)diaztrepat(dot)name>
To: pgsql-general(at)postgresql(dot)org
Subject: very large tables
Date: 2009-05-26 07:28:54
Message-ID: 313dbe090905260028w2f00ae54od267ca6596395751@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you all very much for your thorough replies. To be honest, many of the
things you mention I had not heard of before.
It seems that the clustering by index must be the way to go. I have to
consult now if most queries will benefit from it.
I do not control directly the installation parameters of the database, I do
not have root access to that server but I may be able to find another one
where I can controll these parameteres.
The configuratoin parameters I should look to increase are the ones related
to shared memory then?
I know it is some sort of worst combination, to have such a large database
with an inexperienced dba as myself :)

r.

On Mon, May 25, 2009 at 12:17 PM, Alban Hertroys <
dalroi(at)solfertje(dot)student(dot)utwente(dot)nl> wrote:

> On May 25, 2009, at 10:58 AM, Ramiro Diaz Trepat wrote:
>
> The table with the atmosphere pixels, currently has about 140MM records,
>> and the one the values about 1000MM records. They should both grow to
>> about twice this size.
>>
>
> Did you tune postgres to use the available resources? By default it comes
> with rather modest settings so it will run on low-spec hardware without
> trouble. For a dataset this size modest hardware is obviously out of the
> question, and so the default config will likely not be suitable.
>
>
>> A simple query with no joins like this:
>>
>> select trajectory from atmospherepoint where moment='1979-05-02 11:45:00'
>> and height >= 12000 and height <= 14000 and longitude >= 180 and longitude
>> <= 190 and latitude >= 0 and latitude <= 10;
>>
>
> I think the BETWEEN operator would make that query a bit easier on the
> eyes, like so:
>
> select trajectory from atmospherepoint where moment='1979-05-02 11:45:00'
> and height between 12000 and 14000 and longitude between 180 and 190 and
> latitude between 0 and 10;
>
> is taking almost 4 minutes in a decent multi core server. Also, the
>> moment equality test should yield no more than 2MM rows, so I thought that
>> should make things a lot faster.
>>
>> The explain returns the suspicious "Seq Scan" up front:
>>
>> Seq Scan on atmospherepoint (cost=0.00..5461788.08 rows=134 width=8)
>> Filter: ((moment = '1979-05-02 11:45:00'::timestamp without time zone)
>> AND (height >= 12000::double precision) AND (height <= 14000::double
>> precision) AND (longitude >= 180::double precision) AND (longitude <=
>> 190::double precision) AND (latitude >= 0::double precision) AND (latitude
>> <= 10::double precision))
>>
>
> That query is expecting 134 rows, how accurate is that? An explain analyse
> will show you that (and will take 4 minutes in this case).
>
> I think you'll need lots of memory if you want to keep those indexes in
> memory. If I do a rough byte-count of the datatypes involved times the
> number of records I think just the index data (w/o metadata) takes up 40B *
> 140M rows = 5.6GB. Scanning an index on disk is likely to be slow, which may
> well be one of the reasons PG decides on a sequential scan.
>
> What is the distribution of the data you're querying for? If it's all over
> the place then the planner made a pretty good guess a sequential scan is
> most optimal; random disk I/O isn't very optimal, sequential I/O is usually
> much faster.
>
> Before solving that you'll want to figure out whether this is a typical
> query, or that distributing the data differently will hurt other queries. If
> it's a typical query, then clustering your data on an index and/or
> partitioning will help. If it's not, it still will help this type of query
> but it may hurt other types of queries. You'll want to optimise into the
> right direction.
>
> As other people said already, partitioning is likely to help here. Your
> data seems very suitable for partitioning too. I think I'd divide the
> coordinates in a number of latitude/longitude slices (a grid seems suitable
> to me, but I don't deal with atmospheric data). With 32 slices you'll have
> <500k records per slice on average, whether that's sufficiently small you'll
> have to test.
>
> Alternatively, if your data is clustered (on disk) by say
> longitude/latitude it may already help to define partial indexes on
> longitude latitude ranges, for example:
>
> CREATE INDEX atm_section1_idx (longitude, latitude, height) WHERE longitude
> BETWEEN 180 AND 190 AND latitude BETWEEN 0 AND 10;
>
> This is very similar to partitioning your data, but it doesn't require
> splitting the data up across multiple partitions (tables). This could be
> faster if you would have queries spanning multiple table-partitions, but if
> your data isn't clustered appropriately finding it on disk will require more
> random I/O than a partitioned layout.
> Choices choices... The only way to find out what works best is to test, I'm
> afraid.
>
> but I have created indices for every column in the table and I have also
>> runned ANALIZE and VACUUM:
>>
>> Table "public.atmospherepoint2"
>> Column | Type | Modifiers
>> ------------+-----------------------------+-----------
>> id | integer | not null
>> trajectory | integer | not null
>> moment | timestamp without time zone | not null
>> longitude | real | not null
>> latitude | real | not null
>> height | real | not null
>> Indexes:
>> "atmospherepoint2_pkey" PRIMARY KEY, btree (id)
>> "atm_height_idx" btree (height)
>> "atm_latitude_idx" btree (latitude)
>> "atm_longitude_idx" btree (longitude)
>> "atm_moment_idx" btree (moment)
>> "atm_trajectory_idx" btree (trajectory)
>> Foreign-key constraints:
>> "atmospherepoint2_trajectory_fkey" FOREIGN KEY (trajectory) REFERENCES
>> trajectory2(id)
>>
>
> You do realise that floating point arithmetic isn't accurate, do you? If
> that matters to you you're probably better off using the numeric type,
> although apparently FP arithmetic is faster.
>
> That aside, you're probably better off with a postgis type for the
> co-ordinates, as that's designed for this kind of data.
>
> I will welcome a lot any advice or pointer on how to tune these tables to
>> work faster.
>> The tables don't change at all once the data has been loaded, they are
>> only used for queries.
>> Thank you very much.
>>
>
>
>
> Alban Hertroys
>
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.
>
>
> !DSPAM:894,4a1a7e4a10098724910796!
>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ivan Sergio Borgonovo 2009-05-26 09:41:50 Excel and postgresql functions
Previous Message Scott Marlowe 2009-05-26 05:45:24 Re: how to delete many rows from a huge table?