From: | Arthur Silva <arthurprs(at)gmail(dot)com> |
---|---|
To: | Michael Heaney <mheaney(at)jcvi(dot)org> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Advice for using integer arrays? |
Date: | 2015-01-06 17:48:00 |
Message-ID: | CAO_YK0URbxmxFi+Ra_QTy6-ss9XSbrsoR2k8CBn=zGVvUWVfHg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Jan 6, 2015 3:12 PM, "Michael Heaney" <mheaney(at)jcvi(dot)org> wrote:
>
> I'm fairly new to Postgres, and have a design issue for which an array of
integers might be a good solution. But I'd like to hear from the experts
before proceeding down this path.
>
> Essentially, I'm trying to model the relationship between a group of
biological samples and their genes. Each sample (there are ~10K of them
now, with more coming) will have about 30,000 genes. Conversely, a
particular gene may be present in almost all samples.
>
> So I've created the following table to handle the many-to-many
relationship:
>
> create table sample_gene (id serial, sample_id int, gene_id int);
>
> which looks like this when populated:
>
> sample_id | gene_id
> ---------------------------------------
> 1 | 1
> 1 | 2
> ...
> 1 | 30475
> 2 | 1
> 2 | 2
> ...
> 2 | 29973
> 3 | 1
> etc.
>
> The table now contains hundreds of millions of rows (with many, many more
to come). Join performance between samples and genes is quite slow, even
with indexes on sample_id and gene_id.
>
> So it occurred to me: why not eliminate all the duplicate sample_id
values by storing the gene_id's in an array, like so:
>
> create table sample_gene_array (id serial, sample_id int, gene_id int []
);
>
> So now the table data looks like this:
>
> sample_id | gene_id []
> ---------------------------------------
> 1 | [1:30475]
> 2 | [1:29973]
> etc.
>
> The new table is significantly smaller, and performance (using ANY[] ) is
quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE
background, and so have no experience with arrays as datatypes. Is it okay
to store 30K+ gene values in an array in the linking table (or maybe even
in the sample table itself, thus eliminating the linking table)? Should I
unnest the gene_id's first, before using them to join to the gene table?
>
> TIA for any guidance you can provide. Again, I'm a Postgres neophyte -
but I'm in awe of the power and flexibility of this database, and wish that
I'd started using it sooner.
>
> ------
> Michael Heaney
> JCVI
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
Please provide sample queries so we can understand how you query the data.
From | Date | Subject | |
---|---|---|---|
Next Message | Rob Sargent | 2015-01-06 17:57:32 | Re: Advice for using integer arrays? |
Previous Message | Michael Heaney | 2015-01-06 17:09:56 | Advice for using integer arrays? |