Quick Links

Re: Best way to handle multi-billion row read-only table?

From:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Asher <asher(at)piceur(dot)co(dot)uk>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Best way to handle multi-billion row read-only table?
Date:	2010-02-10 07:52:39
Message-ID:	dcc563d11002092352x491d777ap73e78bc9196226d1@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Tue, Feb 9, 2010 at 11:51 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Asher wrote:
>>
>> Once loaded into the database the data will never be deleted or modified
>> and will typically be accessed over a particular date range for a particular
>> channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A
>> typical query won't return more than a few million rows and speed is not
>> desperately important (as long as the time is measured in minutes rather
>> than hours).
>>
>> Are there any recommended ways to organise this? Should I partition my big
>> table into multiple smaller ones which will always fit in memory (this would
>> result in several hundreds or thousands of sub-tables)? Are there any ways
>> to keep the index size to a minimum? At the moment I have a few weeks of
>> data, about 180GB, loaded into a single table and indexed on sample_time and
>> channel and the index takes up 180GB too.
>
> One approach to consider is partitioning by sample_time and not even
> including the channel number in the index. You've got tiny records; there's
> going to be hundreds of channels of data on each data page pulled in, right?
> Why not minimize physical I/O by reducing the index size, just read that
> whole section of time in to memory (they should be pretty closely clustered
> and therefore mostly sequential I/O), and then push the filtering by channel
> onto the CPU instead. If you've got billions of rows, you're going to end
> up disk bound anyway; minimizing physical I/O and random seeking around at
> the expense of CPU time could be a big win.

If they're put in in a one time load, load them in channel order into
the partitions, and the stats should see the perfect ordering and know
to seq scan the right part of the table. I think.

In response to

Re: Best way to handle multi-billion row read-only table? at 2010-02-10 06:51:12 from Greg Smith

Browse pgsql-general by date

	From	Date	Subject
Next Message	Andy Dale	2010-02-10 07:52:55	Re: logging statements from hibernate to valid SQL
Previous Message	Martijn van Oosterhout	2010-02-10 07:42:10	Re: Memory Usage and OpenBSD