Re: Advice for using integer arrays?

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Advice for using integer arrays?
Date: 2015-01-06 17:57:32
Message-ID: 54AC220C.1000004@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 01/06/2015 10:09 AM, Michael Heaney wrote:
> I'm fairly new to Postgres, and have a design issue for which an array
> of integers might be a good solution. But I'd like to hear from the
> experts before proceeding down this path.
>
> Essentially, I'm trying to model the relationship between a group of
> biological samples and their genes. Each sample (there are ~10K of
> them now, with more coming) will have about 30,000 genes. Conversely,
> a particular gene may be present in almost all samples.
>
> So I've created the following table to handle the many-to-many
> relationship:
>
> create table sample_gene (id serial, sample_id int, gene_id int);
>
> which looks like this when populated:
>
> sample_id | gene_id
> ---------------------------------------
> 1 | 1
> 1 | 2
> ...
> 1 | 30475
> 2 | 1
> 2 | 2
> ...
> 2 | 29973
> 3 | 1
> etc.
>
> The table now contains hundreds of millions of rows (with many, many
> more to come). Join performance between samples and genes is quite
> slow, even with indexes on sample_id and gene_id.
>
> So it occurred to me: why not eliminate all the duplicate sample_id
> values by storing the gene_id's in an array, like so:
>
> create table sample_gene_array (id serial, sample_id int, gene_id int
> [] );
>
> So now the table data looks like this:
>
> sample_id | gene_id []
> ---------------------------------------
> 1 | [1:30475]
> 2 | [1:29973]
> etc.
>
> The new table is significantly smaller, and performance (using ANY[] )
> is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE
> background, and so have no experience with arrays as datatypes. Is it
> okay to store 30K+ gene values in an array in the linking table (or
> maybe even in the sample table itself, thus eliminating the linking
> table)? Should I unnest the gene_id's first, before using them to
> join to the gene table?
>
> TIA for any guidance you can provide. Again, I'm a Postgres neophyte
> - but I'm in awe of the power and flexibility of this database, and
> wish that I'd started using it sooner.
>
> ------
> Michael Heaney
> JCVI
>
>
>
>
>
Think genotype calls (sample, marker, call) and this just explodes in
terms of number of rows but I'm trying to not let that bother me.
Partion by chromosome, or chromosome arm (even if partition is just
separate tables) and the magnitude becomes manageable. At least as
manageable as multi-gigabyte gvcf files per sample from GATK.

But I'm a little confused: all samples for a given species will have the
same genes (roughly). Are you storing gene variants (sequence or
otherwise) per sample?

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tim Smith 2015-01-06 18:02:27 Correct/optimal DML query for application session management ?
Previous Message Arthur Silva 2015-01-06 17:48:00 Re: Advice for using integer arrays?