From: | "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: FW: Surrogate keys (Was: enums) |
Date: | 2006-01-18 23:22:48 |
Message-ID: | 20060118232248.GB17896@pervasive.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
>> Comparing two ints is much, much faster than comparing two text
>> fields. For a small number of comparisons, it doesn't matter. When
>> you're joining tables together, it's a different story.
>
> That is where data independence would come handy... like a better enum,
>with possreps and hidden implementation.
>
> Forgive me my ignorance, but are ints inherently faster to compare than
>strings, or is it just an implementation detail? Ideally, if this is so
>a fully data-independent system would create a hash behind the back of
>user in order to get performance.
The CPU can do an integer comparison with one instruction; it can't do
that with a text string. (Well, theoretically if the string was 3/4
bytes exactly (dependin on null termination) you could compare just as
fast, but I'm pretty certain that no compiler is that fancy.)
>> Here's a real-life example: the primary table for
>> stats.distributed.net has about 120M rows. One field in that table
>> (participant_id) links back to the participant table; it's an int. If
>> instead we used participant_name and that averaged 8 characters in
>> length, that would grow the main table by 1GB (8 chars takes 8 bytes
>> instead of 4, plus there's the varlena header of 4 bytes). The machine
>> that stats runs on has 4G of memory, so cutting 1G of wasted space out
>> of that table helps quite a bit.
>
> OK, hardly a typical example. As I think I left clear, my problem is
>not using surrogate keys, but using them by default, or even
>exclusively.
No? It's certainly not uncommon to have tables with 100M+ rows. And keep
in mind that this applies to every row of every table that has foreign
keys. I'd bet it's actually common to save 1G or more with surrogate
keys in moderately sized databases.
Of course, you do have to be intelligent here, too. The only key defined
on the table in my example is participant_id, project_id, date; there is
no surrogate key because there's no real reason to have one.
>> (In actuality, there isn't participant_name... participants are
>> identified by email address (not a great idea, but I wasn't around
>> when that was chosen). As you can imagine, email addresses are
>> substantially longer than 4 bytes. When we normalized email out of
>> that main table things got substantially faster. That was a number of
>> years ago, so the table was probably 15-25% of it's current size, but
>> it still made a huge difference.)
>
> This isn't normalisation at all, as far as I understand it. It is just
I don't have the rules of normalization memorized enough to know what
form this breaks, but I'm 99% certain it breaks at least one of them.
Look at it this way: if someone wants to change their email address,
best case scenario is that you have cascading RI setup and it updates
thousands of rows in that table. Worst case scenario, you just de-linked
a whole bunch of data. But with a surrogate key, all you have to do is
update one row in one table and you're done.
>that we don't have data independence... so you had to expose an
>implementation detail?
Expose to what? The application? First, this is a pretty minor thing to
expose; second, if it's that big a concern you can completely hide it by
using a view.
But the reality is, dealing with a numeric ID can be a heck of a lot
easier than an email address. Look at URLs that embbed one versus the
other for a good example.
>> Why should it? It's trivial to create views that abstract surrogate
>> keys out, and if you really want to you can even make the views
>> updatable. But here's two other things to consider:
>
> These views, in heavy querying environments, can be prohibitive.
"Normalize 'til it hurts; denormalize 'til it works."
Yes, the added overhead of rules for updates/inserts/deletes could start
to add up in performance-critical code. But if performance is that
critical you're far more likely to run into other bottlenecks first. And
worst-case, you abstract behind a stored procedure that just has the
right queries hard-coded.
As for select-only views you'll have a hard time showing any meaningful
performance penalty.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2006-01-18 23:25:29 | Re: pgxs/windows |
Previous Message | Glen Parker | 2006-01-18 23:11:54 | Re: [HACKERS] No heap lookups on index |