Re: TEXT column > 1Gb

From: Benedict Holland <benedict(dot)m(dot)holland(at)gmail(dot)com>
To: Joe Carlson <jwcarlson(at)lbl(dot)gov>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Rob Sargent <robjsargent(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: TEXT column > 1Gb
Date: 2023-04-12 17:24:34
Message-ID: CAD+mzowRf0w=PST21DZQn8p_=Yzu94WbC3sncBNYS1r5rF=n0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

For documents that long I would seriously consider using large objects and
refencing them with their OIDs. Text fields get put in a special location
within the database. It's similar (possibly exactly) to using large
objects. Also, you can potentially compress them to save space on write and
read. 1gb of text is a lot of text.

See https://www.postgresql.org/docs/current/largeobjects.html

Thanks,
Ben

On Wed, Apr 12, 2023, 1:20 PM Joe Carlson <jwcarlson(at)lbl(dot)gov> wrote:

> I’ve certainly thought about using a different representation. A factor of
> 2x would be good, for a while anyway. For nucleotide sequence, we’d need to
> consider a 10 character alphabet (A, C, G, T, N and the lower case forms
> when representing ’soft masked’ sequence*). So it would be 2 bases/byte.
> (Proteins are not nearly so long so a straight storage is simpler.) But
> these would be bigger changes on the client side than storing in chunks so
> I think this is the way to go.
>
> We’re working with plant genomes, which compared to human chromosomes, are
> HUGE. One chromosome of fava bean is over a gig. And pine tree is another
> monster. This, together with the fact that sequence data collection and
> assembly have improved so much in the past couple years has forced us to
> rethink a lot of our data storage assumptions.
>
> * for those curious, especially in plants, much of sequence consists of
> repetitive element that are remnants of ancient viruses, simple repeats and
> the like. For people who want to identify particular functional components
> in a genome, they typically do not want to search against this sequence but
> restrict searching to coding regions. But the repetitive sequence is still
> important and we need to keep it.
>
> > On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
> wrote:
> >
> >
> >
> >> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson(at)lbl(dot)gov> wrote:
> >>
> >> The use case is genomics. Extracting substrings is common. So going to
> chunked storage makes sense.
> >
> > Are you storing nucleotide sequences as text strings? If using the
> simple 4-character (A,C,G,T) alphabet, you can store four bases per byte.
> If using a nucleotide code 16-character alphabet you can still get two
> bases per byte. An amino acid 20-character alphabet can be stored 8 bases
> per 5 bytes, and so forth. Such a representation might allow you to store
> sequences two or four times longer than the limit you currently hit, but
> then you are still at an impasse. Would a factor or 2x or 4x be enough for
> your needs?
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >
>
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rob Sargent 2023-04-12 17:37:18 Re: TEXT column > 1Gb
Previous Message Joe Carlson 2023-04-12 17:19:57 Re: TEXT column > 1Gb