Re: Designing a DB for storing biological data

From: Rich Shepard <rshepard(at)appl-ecosys(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Designing a DB for storing biological data
Date: 2014-06-14 18:24:58
Message-ID: alpine.LNX.2.11.1406141107100.17243@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, 14 Jun 2014, Damir Dezeljin wrote:

> This is more a theoretical or better to say, conceptual question; still, I
> hope to get some feed backs from you folks. Additionally this is going to
> be a very long post :) off-topic: I asked a similar question on the MySQL
> forum as I'm still undecided if going with PostgreSQL or MySQL << I'm
> tempted at PostGIS.

Postgres. You can link it to spatial attributes with postgis if you want.
But, how to structure your database and tables is independent of the
analytical methods you use.

> I am designing a database for storing various biological and ecological
> data. Although there is no clear dividing line, it is possible to group
> the data into two groups, namely the measured (physical) and quantitative
> (mostly biological) data; I uploaded both a data sample and an initial
> draft of a DB model to this link
> <https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.

Of course there's a clear division between biological, chemical, physical,
geomorphical, and other data.

I may be wrong, but this looks like a question from a student for a
project or master's degree.

> *Biological / quantitative data*

You have many choices: EPA's STORET or EMAP projects, ITIS (International
Taxonomic Identification System), or your own. Each type of biological data
should have its own table; e.g., vegegation, mammals, birds, fish, benthic
macroinvertebrates, microbes.

Each table should have a compound key: site_id, sample_date, species. This
makes each row unique, when possible. Otherwise. use a sequential key.

The table I use for fish data has this schema:

Table "public.fish"
Column | Type | Modifiers
-------------+-----------------------+-----------
site | character varying(12) | not null
sampdate | date | not null
tclass | character(13) | not null
torder | character varying(16) | not null
tfamily | character varying(12) | not null
tgenus | character varying(12) | not null
tspecies | character varying(12) | not null
tsubspecies | character varying(12) |
common_name | character varying(32) |
quant | integer | not null
comments | text |
basin | character varying(10) |
stream | character varying(20) |
Indexes:
"fish_pkey" PRIMARY KEY, btree (site, sampdate, tspecies)

The table I use for benthos has this schema:

Table "public.benthos"
Column | Type | Modifiers

---------------+-----------------------+-----------------------------------
------------------------
sampid | integer | not null default nextval('macroinv
_sampid_seq'::regclass)
site | character varying(12) | not null
sampdate | date | not null
tclass | character varying(20) | not null
torder | character varying(32) | not null
tfamily | character varying(50) |
tgenus | character varying(32) |
tspecies | character varying(20) |
func_feed_grp | character varying(32) |
quant | integer | not null
comments | text |
stream | character varying(20) |
basin | character varying(10) |
Indexes:
"macroinv_pkey" PRIMARY KEY, btree (sampid)

> *Measured / physical data*

> - Physical quantities may be outside the detection range of the measured
> instrument; in such a case, this needs to be recorded. I still do not have
> a clear idea how to do it. NULL’s do not seem to be a good choice to mark
> such data.

Here, too, you have choices. You can incorrectly record and analyze
chemical data below detection limits like EPA, states, and most regulatory
permit holders do, or you can do it correctly. This is a schema I use for
water quality data:

Table "public.surfchem"
Column | Type | Modifiers
----------+-----------------------+-----------
site | character varying(20) | not null
sampdate | date | not null
param | character varying(32) | not null
quant | real |
ceneq1 | integer |
floor | real |
ceiling | real |
stream | character varying(20) |
basin | character varying(10) |
Indexes:
"surfchem_pkey" PRIMARY KEY, btree (site, sampdate, param)

The key is the set (site, sampdate, param) so you can have multiple
samples, each of a different chemical constituent, at the same location and
date. (This works for air, soil, and other media chemistry, too). The
concentrations must be in the same units (e.g., mg/L). The column named
'ceneq1' is a flag: when the quant value is below the analytical laboratory's
reporting limit, set the flag to '1'; otherwise 0. Now you have a consistent
way to identify rows with less-than/non-detect/censored values. Since
reporting levels change as instruments become more sensitive and analysts
become better trained, you can have multiple reporting limits in your data.
That's where the 'floor' and 'ceiling' columns come in.

This should get you started.

Rich

--
Richard B. Shepard, Ph.D.
Applied Ecosystem Services, Inc. | Troutdale, OR 97060 USA
www.appl-ecosys.com Voice: 503-667-4517 Fax: 503-667-8863

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rich Shepard 2014-06-14 19:03:11 Re: Designing a DB for storing biological data
Previous Message Kevin Grittner 2014-06-14 18:15:57 Re: Using pg_start_backup() and pg_stop_backup() - using 9.1.2.2