From: | Joe Carlson <jwcarlson(at)lbl(dot)gov> |
---|---|
To: | Rob Sargent <robjsargent(at)gmail(dot)com> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: TEXT column > 1Gb |
Date: | 2023-04-12 21:03:36 |
Message-ID: | 131DA24F-23D6-4EA1-816F-ED0E6E5A219D@lbl.gov |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
> On Apr 12, 2023, at 12:21 PM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:
>
> On 4/12/23 13:02, Ron wrote:
>> Must the genome all be in one big file, or can you store them one line per table row?
The assumption in the schema I’m using is 1 chromosome per record. Chromosomes are typically strings of continuous sequence (A, C, G, or T) separated by gaps (N) of approximately known, or completely unknown size. In the past this has not been a problem since sequenced chromosomes were maybe 100 megabases. But sequencing is better now with the technology improvements and tackling more complex genomes. So gigabase chromosomes are common.
A typical use case might be from someone interested in seeing if they can identify the regulatory elements (the on or off switches) of a gene. The protein coding part of a gene can be predicted pretty reliably, but the upstream untranslated region and regulatory elements are tougher. So they might come to our web site and want to extract the 5 kb bit of sequence before the start of the gene and look for some of the common motifs that signify a protein binding site. Being able to quickly pull out a substring of the genome to drive a web app is something we want to do quickly.
>
> Not sure what OP is doing with plant genomes (other than some genomics) but the tools all use files and pipeline of sub-tools. In and out of tuples would be expensive. Very,very little "editing" done in the usual "update table set val where id" sense.
yeah. it’s basically a warehouse. Stick data in, but then make all the connections between the functional elements, their products and the predictions on the products. It’s definitely more than a document store and we require a relational database.
>
> Lines in a vcf file can have thousands of colums fo nasty, cryptic garbage data that only really makes sense to tools, reader. Highly denormalized of course. (Btw, I hate sequencing :) )
Imagine a disciplne where some beleaguered grad student has to get something out the door by the end of the term. It gets published and the rest of the community say GREAT! we have a standard! Then the abuse of the standard happens. People who specialize in bioinformatics know just enough computer science, statistics and molecular biology to annoy experts in three different fields.
From | Date | Subject | |
---|---|---|---|
Next Message | Rob Sargent | 2023-04-12 21:29:34 | Re: TEXT column > 1Gb |
Previous Message | Ron | 2023-04-12 20:29:10 | Re: TEXT column > 1Gb |