Re: Best way to handle multi-billion row read-only table?

From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Asher <asher(at)piceur(dot)co(dot)uk>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Best way to handle multi-billion row read-only table?
Date: 2010-02-09 21:53:36
Message-ID: dcc563d11002091353ha05c637re12384224aa89f55@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Feb 9, 2010 at 10:47 AM, Asher <asher(at)piceur(dot)co(dot)uk> wrote:
> Hello.
>
> I'm putting together a database to store the readings from various
> measurement devices for later processing. Since these things (water pressure
> monitors attached to very large water pipes) take readings at 200Hz and are
> typically deployed over multiple sites for several months at a time I've got
> many billions of rows of data, each (at the moment) with the following
> simple format:
>
>        value REAL NOT NULL,
>        sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
>        channel INTEGER REFERENCES channel(id) NOT NULL
>
> (Where the "channel" table contains metadata to identify the particular
> sensor, data logger, etc. used to obtain the data and the combination of
> channel and sample_time is unique.)
>
> Once loaded into the database the data will never be deleted or modified and
> will typically be accessed over a particular date range for a particular
> channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A
> typical query won't return more than a few million rows and speed is not
> desperately important (as long as the time is measured in minutes rather
> than hours).
>
> Are there any recommended ways to organise this? Should I partition my big
> table into multiple smaller ones which will always fit in memory (this would
> result in several hundreds or thousands of sub-tables)?

Partitioning is the standard way. Note that you can partition on 1
axis, or 2, or more. In this case partitioning on time and channel
might make the most sense.

> Are there any ways
> to keep the index size to a minimum? At the moment I have a few weeks of
> data, about 180GB, loaded into a single table and indexed on sample_time and
> channel and the index takes up 180GB too.

It may be that with small enough partitions indexes aren't really
needed. That's been the case for a lot of data I've worked with in
the past.

> Since this is all for a typically budget-restricted PhD project the hardware
> is just a high-end desktop workstation with (at the moment) 2*2TB drives
> organised into a single 4TB partition using FreeBSD's vinum system.

Partitioning should definitely help. You might want to go with RAID-1
instead of RAID-0 since the read performance is similar under most
modern OSes. I know linux now aggregates the two drives together to
read, I'd assume BSD does too. That way you've got better reliability
and about the same performance. Load times will be about double, but
that's a one time thing, right?

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Boszormenyi Zoltan 2010-02-09 22:05:55 Re: ERROR: could not load library "...": Exec format error
Previous Message Asher Hoskins 2010-02-09 21:41:26 Re: Best way to handle multi-billion row read-only table?