From: | Benedict Holland <benedict(dot)m(dot)holland(at)gmail(dot)com> |
---|---|
To: | Joe Carlson <jwcarlson(at)lbl(dot)gov> |
Cc: | Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Rob Sargent <robjsargent(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: TEXT column > 1Gb |
Date: | 2023-04-12 17:24:34 |
Message-ID: | CAD+mzowRf0w=PST21DZQn8p_=Yzu94WbC3sncBNYS1r5rF=n0A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
For documents that long I would seriously consider using large objects and
refencing them with their OIDs. Text fields get put in a special location
within the database. It's similar (possibly exactly) to using large
objects. Also, you can potentially compress them to save space on write and
read. 1gb of text is a lot of text.
See https://www.postgresql.org/docs/current/largeobjects.html
Thanks,
Ben
On Wed, Apr 12, 2023, 1:20 PM Joe Carlson <jwcarlson(at)lbl(dot)gov> wrote:
> I’ve certainly thought about using a different representation. A factor of
> 2x would be good, for a while anyway. For nucleotide sequence, we’d need to
> consider a 10 character alphabet (A, C, G, T, N and the lower case forms
> when representing ’soft masked’ sequence*). So it would be 2 bases/byte.
> (Proteins are not nearly so long so a straight storage is simpler.) But
> these would be bigger changes on the client side than storing in chunks so
> I think this is the way to go.
>
> We’re working with plant genomes, which compared to human chromosomes, are
> HUGE. One chromosome of fava bean is over a gig. And pine tree is another
> monster. This, together with the fact that sequence data collection and
> assembly have improved so much in the past couple years has forced us to
> rethink a lot of our data storage assumptions.
>
> * for those curious, especially in plants, much of sequence consists of
> repetitive element that are remnants of ancient viruses, simple repeats and
> the like. For people who want to identify particular functional components
> in a genome, they typically do not want to search against this sequence but
> restrict searching to coding regions. But the repetitive sequence is still
> important and we need to keep it.
>
> > On Apr 12, 2023, at 10:04 AM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
> wrote:
> >
> >
> >
> >> On Apr 12, 2023, at 7:59 AM, Joe Carlson <jwcarlson(at)lbl(dot)gov> wrote:
> >>
> >> The use case is genomics. Extracting substrings is common. So going to
> chunked storage makes sense.
> >
> > Are you storing nucleotide sequences as text strings? If using the
> simple 4-character (A,C,G,T) alphabet, you can store four bases per byte.
> If using a nucleotide code 16-character alphabet you can still get two
> bases per byte. An amino acid 20-character alphabet can be stored 8 bases
> per 5 bytes, and so forth. Such a representation might allow you to store
> sequences two or four times longer than the limit you currently hit, but
> then you are still at an impasse. Would a factor or 2x or 4x be enough for
> your needs?
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >
>
>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Rob Sargent | 2023-04-12 17:37:18 | Re: TEXT column > 1Gb |
Previous Message | Joe Carlson | 2023-04-12 17:19:57 | Re: TEXT column > 1Gb |