FW: Surrogate keys (Was: enums)

From: "Jim Nasby" <jnasby(at)pervasive(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: FW: Surrogate keys (Was: enums)
Date: 2006-01-18 22:47:11
Message-ID: D1D2D51E3BE3FC4E98598248901F759402C890D0@ausmail2k4.aus.pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Ooops, fat-finger'd -hackers...

-----Original Message-----
Adding -hackers back to the list.

> From: Leandro Guimarães Faria Corcete Dutra
> Em Seg, 2006-01-16 às 12:52 -0600, Jim C. Nasby escreveu:
> > On Sat, Jan 14, 2006 at 07:28:21PM +0900, Michael Glaesemann wrote:
> > >
> > > For UPDATEs and INSERTs, the "proper" primary key also
> needs to be
> > > checked, but keys are used for more than just checking
> uniqueness:
> > > they're also often used in JOINs. Joining against a
> single integer
> > > I'd think it quite a different proposition (I'd think
> faster in terms
> > > of performance) than joining against, say, a text column or a
> > > composite key.
>
> How different is that?

Comparing two ints is much, much faster than comparing two text fields. For a small number of comparisons, it doesn't matter. When you're joining tables together, it's a different story.

> > a) the optimizer does a really poor job on multi-column
> index statistics
>
> Then it should eventually be fixed?

It's on the to-do, but it's not an easy nut to crack.

> > b) If each parent record will have many children, the space
> savings from
> > using a surrogate key can be quite large
>
> Only where the surrogate is significantly smaller than
> the natural?

#define significant

Here's a real-life example: the primary table for stats.distributed.net has about 120M rows. One field in that table (participant_id) links back to the participant table; it's an int. If instead we used participant_name and that averaged 8 characters in length, that would grow the main table by 1GB (8 chars takes 8 bytes instead of 4, plus there's the varlena header of 4 bytes). The machine that stats runs on has 4G of memory, so cutting 1G of wasted space out of that table helps quite a bit.

(In actuality, there isn't participant_name... participants are identified by email address (not a great idea, but I wasn't around when that was chosen). As you can imagine, email addresses are substantially longer than 4 bytes. When we normalized email out of that main table things got substantially faster. That was a number of years ago, so the table was probably 15-25% of it's current size, but it still made a huge difference.)

> > c) depending on how you view things, putting actual keys
> all over the
> > place is denormalized
>
> How come?

See my other reply... :)

> > Generally, I just use surrogate keys for everything unless
> performance
> > dictates something else.
>
> Shouldn't it be the other way round, for the user's sake?

Why should it? It's trivial to create views that abstract surrogate keys out, and if you really want to you can even make the views updatable. But here's two other things to consider:

In many cases you can't define a single field as a unique key. So you end up with having to add many extra keys to all your join clauses. Not very friendly, and prone to error.

Not every language has equal support for text comparisons (and in my experience, almost all real keys are mostly text).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Glen Parker 2006-01-18 23:11:54 Re: [HACKERS] No heap lookups on index
Previous Message Jim C. Nasby 2006-01-18 22:38:53 Re: No heap lookups on index