<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<br>
On 12/8/2011 7:05 AM, Michael Lush wrote:
<blockquote
cite="mid:CACXX7MdoDdACfJMfhnugNoGxAhe-n5kxr716tGt6iUZ1n4ZKyQ(at)mail(dot)gmail(dot)com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
I have dataset with ~10000 columns and ~200000 rows (GWAS data
(1)) in the form<br>
<br>
sample1, A T, A A, G C, ....<br>
sampel2, A C, C T, A A, ....<br>
<br>
I'd like to take subsets of both columns and rows for analysis<br>
<br>
Two approaches spring to mind either unpack it into something like
an RDF triple<br>
<br>
ie <br>
CREATE TABLE long_table (<br>
sample_id varchar(20),<br>
column_number int,<br>
snp_data varchar(3));<br>
<br>
for a table with 20 billion rows<br>
<br>
or use the array datatype<br>
<br>
CREATE TABLE wide_table (<br>
sample_id,<br>
snp_data[]);<br>
<br>
Does anyone have any experience of this sort of thing?<br>
<br>
(1) <a moz-do-not-send="true"
href="http://en.wikipedia.org/wiki/Genome-wide_association_study"
target="_blank">http://en.wikipedia.org/wiki/Genome-wide_association_study</a><br>
<br>
--<br>
Michael Lush<br>
</blockquote>
<br>
I store all my genotype data similar to this. <br>
<br>
<tt>CREATE TABLE gen1000<br>
(<br>
-- Inherited from table genotypes: snp_number integer NOT NULL,<br>
-- Inherited from table genotypes: sample_id integer NOT NULL,<br>
-- Inherited from table genotypes: genotype smallint NOT NULL,<br>
CONSTRAINT check1000 CHECK (snp_number < 58337 AND sample_id
> 100000000 AND sample_id < 101000000)<br>
)<br>
INHERITS (genotypes)<br>
WITH (<br>
OIDS=FALSE<br>
);<br>
ALTER TABLE gen1000<br>
OWNER TO postgres;<br>
COMMENT ON TABLE gen1000<br>
IS '100 ANG';<br>
<br>
-- Index: xgen1000_sample_id<br>
CREATE INDEX xgen1000_sample_id<br>
ON gen1000<br>
USING btree<br>
(sample_id )<br>
WITH (FILLFACTOR=100)<br>
TABLESPACE index_tablespace;<br>
ALTER TABLE gen1000 CLUSTER ON xgen1000_sample_id;<br>
<br>
-- Index: xgen1000_snp_number<br>
CREATE INDEX xgen1000_snp_number<br>
ON gen1000<br>
USING btree<br>
(snp_number )<br>
WITH (FILLFACTOR=100)<br>
TABLESPACE index_tablespace;</tt><br>
<br>
My implementation is basically a data warehouse where I am the only
person with access to the db.<br>
<br>
There are several benefits to storing your genotypes this way and
there are benefits to storing your genotypes in more of a matrix
format. If you store them as your "long" table you really need to
think about whether or not you can partition your data. In my case
I'm able to partition by sample_id because all of my animals from a
given breed (population, race, location etc) have IDs in a certain
range. Furthermore, I'm able to create partitions based on which
assay (Illumina SNP50, Illumina HD, AFFX BOS-1) the genotypes came
from because my snp_numbers (rather than marker names) are integer
and specifically structured so ranges of integers correspond to an
assay. Therefore, a given genotype table is only as long as the
number of animals X number of markers on the assay. I have one
partition that would be very large (20000 animals X 60000 markers =
1.2B) but what I did was further split it up into sub-partitions of
~220M.<br>
<br>
I have separate tables for marker information and sample information
which are keyed on snp_number and sample_id. Given the structure
and appropriate indexes, and check constraints I'm able to grab
specific "chunks" of data fairly easily.<br>
<br>
As a side note, I take every opportunity I get to advise people not
to store GWAS data in base format (AGCT). If you are using Illumina
data at the very least store data and work with it in A/B format.
Ditto for Affymetrix. I actually convert the A/B format to my own
coding and store genotypes as a single small integer (genotype
above) where AA=1, BB=2, AB=3, missing=10. At some point I'm going
to change this to missing =0. Here's why this is important. With
this coding you are able to store the genotype of an individual
using 2 bytes. With this coding you can also store phase
information if you have it... B/A=4, A/null=5, B/null =6, null/A=7,
null/B=8, null/null=9. One other side benefit, if you are wanting
to check parent/child inheritance all you need to do is find all the
loci where parent_genotype + child_genotype = 3 and count the number
of loci. Those are the only informative genotypes. If you're
working in base space it's more complicated. Furthermore, you
could take this data in columnar form and dump it and rewrite it as
a string (wide) where each value is an individual genotype, which is
about as compact as you can get it. You just need to write your
analysis software to decode it.<br>
<br>
Bob<br>
<br>
<br>
</body>
</html>