Re: more than 2GB data string save

From: Peter Hunsberger <peter(dot)hunsberger(at)gmail(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: more than 2GB data string save
Date: 2010-02-10 16:10:24
Message-ID: cc159a4a1002100810l55f5d3dal1634abf0ac284972@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
>
> On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <steve(at)blighty(dot)com> wrote:
> > A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for.
> >
> > Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library.
> >
> > If the answer is "yes, it's really that big and it's really text" then look at clucene or, better, hiring a specialist.
>
> I'm betting it's something like gene sequences or geological samples,
> or something other than straight text.  But even those bear breaking
> down into some kind of simple normalization scheme don't they?
>

A single genome is ~ 1.3GB as chars, half that size if you use 4 bits
/ nucleotide (which should work for at least 90% of the use cases).
Simplest design is to store a single reference and then for everything
else store deltas from it. On average that should require about about
3-5% of your reference sequence per comparative sample (not counting
FKs and indexes).

As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres. If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...

--
Peter Hunsberger

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Boreham 2010-02-10 16:10:31 Re: [PERFORM] PostgreSQL - case studies
Previous Message Pavel Stehule 2010-02-10 16:07:34 Re: Orafce concat operator