Re: Primary Key

From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Primary Key
Date: 2007-11-18 17:51:23
Message-ID: 20071118175123.GD1955@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Nov 18, 2007 at 04:19:01AM +0000, Ron Johnson wrote:
> If the PK was synthetic and generated by the engine, then a (buggy)
> app could insert duplicate tolls and the system wouldn't utter a
> peep. But the customer sure would when he saw the duplicate entries.

You'd just need to put a UNIQUE constraint on an appropriate set of
columns and the database would complain in all the same places.

One big benefit that I've been able to discern with having natural,
over synthetic, keys is that you need to pull in every table up the
hierarchy when you want to find out something lower down. An example
would probably help; say we were recording samples from locations, with
these samples being collected in groups on specific visits. I've been
using synthetic keys more recently, so I'll start with them:

CREATE TABLE locations (
id SERIAL NOT NULL PRIMARY KEY,
locntype TEXT
);
CREATE TABLE visits (
id SERIAL PRIMARY KEY,
locnid INTEGER NOT NULL REFERENCES locations,
visitnum INTEGER NOT NULL,
UNIQUE (locnid,visitnum),
visitdate DATE
);
CREATE TABLE samples (
id SERIAL PRIMARY KEY,
visitid INTEGER NOT NULL REFERENCES visits,
samplenum INTEGER NOT NULL,
UNIQUE (visitid,samplenum),
barcode TEXT UNIQUE
);

Say I wanted to summarise the number of samples per location type, I'd
be forced to join onto the visits table; like this:

SELECT l.locntype, COUNT(*)
FROM locations l, visits v, samples s
WHERE l.id = v.locnid
AND v.id = s.visitid
GROUP BY l.locntype;

If we now organise the tables using natural keys (note that the location
id is still pretty artificial as it has to map back to something in the
real world):

CREATE TABLE locations (
id TEXT NOT NULL PRIMARY KEY,
locntype TEXT
);
CREATE TABLE visits (
locnid TEXT NOT NULL REFERENCES locations,
visitnum INTEGER NOT NULL,
PRIMARY KEY (locnid,visitnum),
visitdate DATE
);
CREATE TABLE samples (
locnid TEXT NOT NULL,
visitnum INTEGER NOT NULL,
samplenum INTEGER NOT NULL,
PRIMARY KEY (locnid,visitnum,samplenum),
FOREIGN KEY (locnid,visitnum) REFERENCES visits,
barcode TEXT UNIQUE
);

This has a nice simplifying effect on the above query; I can directly
refer to the location for each sample:

SELECT l.locntype, COUNT(*) AS samples
FROM locations l, samples s
WHERE l.id = s.locnid
GROUP BY l.locntype;

Is that a reasonable example of the different ways of laying things out
in the two styles?

The main advantage of synthetic keys is that it introduces another
layer of abstraction into the database, potentially allowing greater
flexibility (of course this can also be a burden).

Natural keys allow the above simplification of queries and ease the
manual fix-up of the data should something go horribly wrong. It's
reasonably easy to fix things if you have a sample referencing a
non-existent visit with natural keys, but if you've got synthetic keys
you're probably going to have to dump the sample as well. Of course,
if you've got yourself into this state you've already failed, but some
recourse is nice.

In summary; I see various advantages to either approach, but I don't
see either as being fundamentally "better". You can express the same
constraints in either style (I'm sure someone will think of a good
counter example though), it's the context in which it's used that will
determine which is more suitable to the task at hand. Using one style
exclusively is almost certainly bad, but having a preference for one or
the other is probably good as it'll make the database as a whole more
cohesive and subsequently ease maintenance.

Comments?

Sam

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Shane Ambler 2007-11-18 17:55:23 Re: Compressed Backup too big
Previous Message Tom Lane 2007-11-18 17:49:55 Re: HINT: Please REINDEX it.