Quick Links

Re: Best way to handle multi-billion row read-only table?

From:	"BillR" <iambill(at)williamrosmus(dot)com>
To:	<pgsql-general(at)postgresql(dot)org>
Subject:	Re: Best way to handle multi-billion row read-only table?
Date:	2010-02-10 03:06:19
Message-ID:	004901caa9fe$04c7c9f0$0e575dd0$@com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Is it also possible to denormalize by putting the 'channel' data in the
first table (especially if it isn't very much)? Maintaining a foreign key
constraint can impact performance significantly in most RDBMS's, even when
deferring checking. I could be wrong, but I suspect PostgreSQL is no
different. Or keep the data normalized and remove the constraint
altogether. Also remove any primary key constraint so that it doesn't have
to check uniqueness, and avoid as many indexes as you can.

You have to take a leap of faith that you created your program well enough
to not get out of sync.

I would be interested to hear comments on this. These are some of the things
we did on systems I have worked on running Oracle that handled even higher
volumes (tens to hundreds of thousands of transactions per second or
higher... sustained throughout the day at least on the lower volume).
Granted we had real heavy hardware but the DBAs forbade us to create
constraints and indexes etc. for this reason; except on less active tables.
Everyone has already talked about partitioning, but load balancing across
machines if you can afford a couple or few more could help too. Not sure
what facility Postgres has for this though (I would be interested to hear
comments on this too! :-)

BillR

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Asher
Sent: February-09-10 12:47 PM
To: pgsql-general(at)postgresql(dot)org
Subject: [GENERAL] Best way to handle multi-billion row read-only table?

Hello.

I'm putting together a database to store the readings from various
measurement devices for later processing. Since these things (water
pressure monitors attached to very large water pipes) take readings at
200Hz and are typically deployed over multiple sites for several months
at a time I've got many billions of rows of data, each (at the moment)
with the following simple format:

value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL

(Where the "channel" table contains metadata to identify the particular
sensor, data logger, etc. used to obtain the data and the combination of
channel and sample_time is unique.)

Once loaded into the database the data will never be deleted or modified
and will typically be accessed over a particular date range for a
particular channel (e.g. "sample_time >= X AND sample_time <= Y AND
channel=Z"). A typical query won't return more than a few million rows
and speed is not desperately important (as long as the time is measured
in minutes rather than hours).

Are there any recommended ways to organise this? Should I partition my
big table into multiple smaller ones which will always fit in memory
(this would result in several hundreds or thousands of sub-tables)? Are
there any ways to keep the index size to a minimum? At the moment I have
a few weeks of data, about 180GB, loaded into a single table and indexed
on sample_time and channel and the index takes up 180GB too.

Since this is all for a typically budget-restricted PhD project the
hardware is just a high-end desktop workstation with (at the moment)
2*2TB drives organised into a single 4TB partition using FreeBSD's vinum
system.

Many thanks for any help,
Asher.

--
Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--------------------------------
Spam/Virus scanning by CanIt Pro

For more information see
http://www.kgbinternet.com/SpamFilter.htm

To control your spam filter, log in at
http://filter.kgbinternet.com

--
BEGIN-ANTISPAM-VOTING-LINKS
------------------------------------------------------

Teach CanIt if this mail (ID 80270060) is spam:
Spam:
http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002
09&c=s
Not spam:
http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002
09&c=n
Forget vote:
http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002
09&c=f
------------------------------------------------------
END-ANTISPAM-VOTING-LINKS

__________ Information from ESET Smart Security, version of virus signature
database 4852 (20100209) __________

The message was checked by ESET Smart Security.

http://www.eset.com

In response to

Best way to handle multi-billion row read-only table? at 2010-02-09 17:47:10 from Asher

Responses

Re: Best way to handle multi-billion row read-only table? at 2010-02-10 03:16:52 from Scott Marlowe

Browse pgsql-general by date

	From	Date	Subject
Next Message	Scott Marlowe	2010-02-10 03:16:52	Re: Best way to handle multi-billion row read-only table?
Previous Message	Jeff Davis	2010-02-10 01:27:46	Re: Recovering data from table show corruption with "invalid page header in block X"