From: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
---|---|
To: | Michael Heaney <mheaney(at)jcvi(dot)org>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Advice for using integer arrays? |
Date: | 2015-01-07 00:19:23 |
Message-ID: | 54AC7B8B.6060708@aklaver.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 01/06/2015 01:18 PM, Michael Heaney wrote:
> On 1/6/2015 2:19 PM, Jeff Janes wrote:
>> On Tue, Jan 6, 2015 at 9:09 AM, Michael Heaney <mheaney(at)jcvi(dot)org
>> <mailto:mheaney(at)jcvi(dot)org>> wrote:
>>
>> I'm fairly new to Postgres, and have a design issue for which an
>> array of integers might be a good solution. But I'd like to hear
>> from the experts before proceeding down this path.
>>
>> Essentially, I'm trying to model the relationship between a group
>> of biological samples and their genes. Each sample (there are ~10K
>> of them now, with more coming) will have about 30,000 genes.
>> Conversely, a particular gene may be present in almost all samples.
>>
>> So I've created the following table to handle the many-to-many
>> relationship:
>>
>> create table sample_gene (id serial, sample_id int, gene_id int);
>>
>>
>> What is the value of having this table at all? It doesn't seem to
>> contain anything informative, like an allele identifier, a resequence,
>> or a copy number variation. If you are just trying to record the fact
>> that a gene was present in that sample, perhaps it would be better to
>> instead record the genes have been deleted, rather than the ones that
>> have not been deleted? That would probably be a much smaller list.
>>
>
> I suppose there could be a gene table which would contain data about
> each gene_id.
> But I'm an IT guy, not a biologist, and my sample_gene table doesn't
> actually
> exist.
Alright, now I am confused. In your original post you say sample_gene
does exist and you joined it against genes(I assume a gene table) which
is when you encountered slow performance. Did I miss something:)?
Per a previous suggestion, it might be good to show the queries you are
using or plan to use. The issue may be in the query not the layout. Also
running the query with EXPLAIN ANALYZE would be helpful:
http://www.postgresql.org/docs/9.3/interactive/sql-explain.html
I'm more concerned with how to deal with many-to-many relationships
> when each parent could have tens of thousands of children. Collapsing all
> the children into an array for each parent looked intriguing - but maybe
> it's
> not a good idea. I just don't know, so I thought I'd ask you guys.
>
>
>
>> create table sample_gene_array (id serial, sample_id int, gene_id
>> int [] );
>>
>> So now the table data looks like this:
>>
>> sample_id | gene_id []
>> ---------------------------------------
>> 1 | [1:30475]
>> 2 | [1:29973]
>> etc.
>>
>>
>> I'm not familiar with the square bracket and colon as a syntax for
>> expressing int arrays. Are you taking liberties with the psql output,
>> or using a different client program? Does that represent the range
>> from 1 to 30475, or the two values 1 and 30475?
>
> Yes, it's shorthand for the full range of values from 1 to 30475. Wasn't
> sure how
> to represent it,
Use Python? Looks like a 1 based slice notation.
and almost went with (1,2,3...30475). Apologies...
>
>
> Michael Heaney
> JCVI
>
>
>
--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Sameer Kumar | 2015-01-07 01:30:01 | Re: Does anyone user pg-pool II on real production ? Please help me. |
Previous Message | Paul Jungwirth | 2015-01-06 22:24:59 | Re: Advice for using integer arrays? |