Quick Links

Re: Big wide datasets

From:	"Robert D(dot) Schnabel" <schnabelr(at)missouri(dot)edu>
To:	Michael Lush <mjlush(at)gmail(dot)com>
Cc:	pgsql-novice(at)postgresql(dot)org
Subject:	Re: Big wide datasets
Date:	2011-12-08 13:39:16
Message-ID:	4EE0BE04.5020707@missouri.edu
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-novice

<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
 
 
On 12/8/2011 7:05 AM, Michael Lush wrote:
<blockquote
cite="mid:CACXX7MdoDdACfJMfhnugNoGxAhe-n5kxr716tGt6iUZ1n4ZKyQ(at)mail(dot)gmail(dot)com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
I have dataset with ~10000 columns and ~200000 rows (GWAS data
(1)) in the form 
 
sample1, A T, A A, G C, .... 
sampel2, A C, C T, A A, .... 
 
I'd like to take subsets of both columns and rows for analysis 
 
Two approaches spring to mind either unpack it into something like
an RDF triple 
 
ie 
CREATE TABLE long_table ( 
                               sample_id  varchar(20), 
                               column_number int, 
                               snp_data  varchar(3)); 
 
for a table with 20 billion rows 
 
or use the array datatype 
 
CREATE TABLE wide_table ( 
                                sample_id, 
                                snp_data[]); 
 
Does anyone have any experience of this sort of thing? 
 
(1) <a moz-do-not-send="true"
href="http://en.wikipedia.org/wiki/Genome-wide_association_study"
target="_blank">http://en.wikipedia.org/wiki/Genome-wide_association_study</a> 
 
-- 
Michael Lush 
</blockquote>
 
I store all my genotype data similar to this.  
 
<tt>CREATE TABLE gen1000 
( 
-- Inherited from table genotypes:  snp_number integer NOT NULL, 
-- Inherited from table genotypes:  sample_id integer NOT NULL, 
-- Inherited from table genotypes:  genotype smallint NOT NULL, 
  CONSTRAINT check1000 CHECK (snp_number < 58337 AND sample_id
> 100000000 AND sample_id < 101000000) 
) 
INHERITS (genotypes) 
WITH ( 
  OIDS=FALSE 
); 
ALTER TABLE gen1000 
  OWNER TO postgres; 
COMMENT ON TABLE gen1000 
  IS '100 ANG'; 
 
-- Index: xgen1000_sample_id 
CREATE INDEX xgen1000_sample_id 
  ON gen1000 
  USING btree 
  (sample_id ) 
  WITH (FILLFACTOR=100) 
TABLESPACE index_tablespace; 
ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id; 
 
-- Index: xgen1000_snp_number 
CREATE INDEX xgen1000_snp_number 
  ON gen1000 
  USING btree 
  (snp_number ) 
  WITH (FILLFACTOR=100) 
TABLESPACE index_tablespace;</tt> 
 
My implementation is basically a data warehouse where I am the only
person with access to the db. 
 
There are several benefits to storing your genotypes this way and
there are benefits to storing your genotypes in more of a matrix
format.  If you store them as your "long" table you really need to
think about whether or not you can partition your data.  In my case
I'm able to partition by sample_id because all of my animals from a
given breed (population, race, location etc) have IDs in a certain
range.  Furthermore, I'm able to create partitions based on which
assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came
from because my snp_numbers (rather than marker names) are integer
and specifically structured so ranges of integers correspond to an
assay.  Therefore, a given genotype table is only as long as the
number of animals X number of markers on the assay.  I have one
partition that would be very large (20000 animals X 60000 markers =
1.2B) but what I did was further split it up into sub-partitions of
~220M. 
 
I have separate tables for marker information and sample information
which are keyed on snp_number and sample_id.  Given the structure
and appropriate indexes, and check constraints I'm able to grab
specific "chunks" of data fairly easily. 
 
As a side note, I take every opportunity I get to advise people not
to store GWAS data in base format (AGCT).  If you are using Illumina
data at the very least store data and work with it in A/B format. 
Ditto for Affymetrix.  I actually convert the A/B format to my own
coding and store genotypes as a single small integer (genotype
above) where AA=1, BB=2, AB=3, missing=10.  At some point I'm going
to change this to missing =0.  Here's why this is important.  With
this coding you are able to store the genotype of an individual
using 2 bytes.  With this coding you can also store phase
information if you have it... B/A=4, A/null=5, B/null =6, null/A=7,
null/B=8, null/null=9.  One other side benefit, if you are wanting
to check parent/child inheritance all you need to do is find all the
loci where parent_genotype + child_genotype = 3 and count the number
of loci.  Those are the only informative genotypes.  If you're
working in base space it's more complicated.   Furthermore, you
could take this data in columnar form and dump it and rewrite it as
a string (wide) where each value is an individual genotype, which is
about as compact as you can get it.  You just need to write your
analysis software to decode it. 
 
Bob 
 
 
</body>
</html>

Attachment	Content-Type	Size
unknown_filename	text/html	6.4 KB

In response to

Big wide datasets at 2011-12-08 13:05:19 from Michael Lush

Browse pgsql-novice by date

	From	Date	Subject
Next Message	Gene Poole	2011-12-08 15:04:37	Re: [GENERAL] Convert / Migrate From Oracle 11gR2 To PostgreSQL ? On CentOS 5.7 x86_64
Previous Message	Jean-Yves F. Barbier	2011-12-08 13:24:44	Re: Big wide datasets