Designing a DB for storing biological data

From: Damir Dezeljin <damir(dot)dezeljin(at)dezo(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Designing a DB for storing biological data
Date: 2014-06-14 15:52:43
Message-ID: CAM6QOa-2iAEMia1rKfjRtY31SjO2XM5=bkaHBb_MyCkFNCWAAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello.

This is more a theoretical or better to say, conceptual question; still, I
hope to get some feed backs from you folks. Additionally this is going to
be a very long post :) off-topic: I asked a similar question on the MySQL
forum as I'm still undecided if going with PostgreSQL or MySQL << I'm
tempted at PostGIS.

I am designing a database for storing various biological and ecological
data. Although there is no clear dividing line, it is possible to group the
data into two groups, namely the measured (physical) and quantitative
(mostly biological) data; I uploaded both a data sample and an initial
draft of a DB model to this link
<https://www.dropbox.com/sh/9gm2ezwrwhkz6xv/AAB3koD6Xzi48-2BhIEdwmlZa>.

From the mentioned sample, it is evident the following difference between
the two:
*Biological / quantitative data*

- The data are actually numbers of occurrences of a specific type of
items, namely animal and plant spices. The counting is done by following a
predefined method as e.g. number of samples per 100 m^2.
- One sampling is
- A sampling consist of counting multiple species on a single day,
predefined location, by following a predefined method. Please note the
counting may repeat multiple time for a single species using the same or a
different method.
- A typical number of different species counted per sampling is
something between 15 and 100.
- Data are mostly quantitative, which means consisting mostly of
integers numbers; however, this does not apply to all cases.

*Measured / physical data*

- This data comprise from e.g. a set of measured physical quantities
such as temperature, salinity, DI, etc. (usually up to 15 or 20
quantities). These measurements are performed on samples of waters taken
from different depths at a predefined location on a predefined date and
time. Although the samples of water from different depth on a single
location are taken a couple of minutes apart one from another, it would
help tracking them as a single profile, which basically consists of data of
analyzed samples from a single location at a specific time.
- Most data are decimal numbers of certain precision - e.g. if the
instrument provides accurate information to the first decimal place, it has
to be stored with precision up to the first decimal place. Contrary, the
salinity from the mentioned example available at the link above is measured
accurately to the third decimal place, so it makes sense to store it and
make it possible to retrieve the number accurate to the third decimal place.
I was also considering storing depth as a NUMERIC to avoid inexactness
when dealing with REAL or DOUBLE -> from MySQL I have a concern two FLOAT-s
(REAL in PostgreSQL) being 3.4 can't be compared in a quely like value1 =
value2 -> e.g. "... WHERE depth = 3.4;".
Am I missing something or is there a better solution how to address such
cases?

*General notes*

- Physical quantities may be outside the detection range of the measured
instrument; in such a case, this needs to be recorded. I still do not have
a clear idea how to do it. NULL’s do not seem to be a good choice to mark
such data.
- Different quantities are measured with different precision - e.g.
counted quantities don’t have decimal places; some instruments report data
with 1 decimal digit precision, other with 2, etc.
- The only quantities that are always present with all data recorded are
the depth where the sample was taken.
- I use RESTful interface as a mean layer between the DB and the GUI.

*Finally, here is my dilemma*
I am somewhat undecided what is the best way to implement the database and
consequently what kind of queries to use. At above link a database model I
am currently working on can be found . Looking to the diagram it becomes
evident I am deciding if storing every measurement / determinant / depth
triple as a separate record. The biggest dilemma I have is a query for a
simple sample of pressure, temperature, salinity and oxygen would imply
multiple joins. As far as I know, this will badly affect the performance;
as well, it will harden codding the RESTful interface.

The other option I considered and I did not discard yet is adopting tables
to specific needs. In such case storing data from a CTD (Conductivity /
Temperature / Depth) probe would result in a table row containing: depth,
conductivity, salinity, temperature, depth. Such approach rather makes
sense; however, in such a case I’ll end up with tons of tables that
sometime in future may be extended with additional columns.

I would appreciate any advice and hint I receive.

Thanks and best regards,
Damir

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2014-06-14 17:52:37 Re: Designing a DB for storing biological data
Previous Message Khangelani Gama 2014-06-14 13:44:38 Using pg_start_backup() and pg_stop_backup() - using 9.1.2.2