Re: Case Insensitive Data Type

From: "Peter Darley" <pdarley(at)kinesis-cem(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Russell Black" <russell(dot)black(at)iarchives(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Case Insensitive Data Type
Date: 2002-05-24 22:16:17
Message-ID: NNEAICKPNOGDBHNCEDCPIEBCCJAA.pdarley@kinesis-cem.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Friends,
I don't want to imply something about things I don't know anything about,
but I'm interested in why nobody has suggested creating a new data type
using CREATE TYPE, setting up it's various functions and creating new
operators for it. It looks (reading the docs) like it would do what Russel
wants without too much work. Is there some reason that folks avoid this
approach?
Thanks,
Peter Darley

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org]On Behalf Of Tom Lane
Sent: Friday, May 24, 2002 11:48 AM
To: Russell Black
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] Case Insensitive Data Type

"Russell Black" <russell(dot)black(at)iarchives(dot)com> writes:
> I'm using email addresses as a primary key in one of my tables.
Currently,=
> I have to ensure that the email addresses are converted to lower case
befo=
> re they get put into the table, and that all lookups on that field are
conv=
> erted to lower case before the select statement, in order to ensure that
Jo=
> e(at)Somewhere(dot)com is the same as joe(at)somewhere(dot)com(dot)

A partial answer to this is

CREATE UNIQUE INDEX foo_key on foo (lower(email));

which will prevent duplicate entries with the same-up-to-case-folding
address. However you'd still have to query with queries like

SELECT * FROM foo WHERE lower(email) = lower('probe value');

if the probe values you are handed can't be relied on to be downcased
already. (I tried to think of a way to insert the lower() automatically
using a rule, but couldn't come up with anything --- ideas anyone?)

> Does anyone know of a case-insensitive data type?

I do not think you should go away feeling that a case-insensitive data
type would magically solve your problems. Today you might think that
case-folding is what you want, but by no later than next week you would
figure out that there are other sorts of normalizations you'll also need
to do on provided addresses --- one obvious example is stripping leading
and trailing blanks and reducing consecutive blanks to a single blank.
In fact if you really want to look like you know what you're doing,
you'll have to strip out the comment portions of an address entirely.
For example, these are all equivalent forms per RFC specs:
joe(at)blow(dot)com
Joe Blow <joe(at)blow(dot)com>
"Joe Blow" <joe(at)blow(dot)com>
joe(at)blow(dot)com (Joe Blow)
and should be recognized as such by anything that pretends to know
what email addresses are. (For that matter, you do know that the
specs say only the hostname part should be assumed case-insensitive,
don't you?)

So the real bottom line here is that you'd better figure a way to pass
the input strings through an address-normalization function. "lower()"
might do as a first-order approximation but you won't stay with it
forever.

> Can I create a custom data type to do this?

You probably could, but it'd likely be more work than you want. A
minimal implementation would require an input function (which would
apply the normalization rules you want), an output function (for which
you could probably just commandeer textout), and a no-op text() coercion
function (assuming you'd like to do anything besides display the value).
Then you'd have to create a set of index operators and access method
entries, if you intended to make this indexable --- these could all
piggyback on text operators, but you'd still need to do the work of
making the catalog entries for them.

Assuming you did have all that, I think a query like

select * from foo where email = 'Joe Blow <joe(at)blow(dot)com>'

would indeed work the way you want --- the initially untyped string
literal would eventually get coerced to your datatype, and then your
input conversion routine could do the right things to it. But it'll
likely be easier to make it happen on the client side ;-)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2002-05-24 22:24:58 Re: How to add big amounts of data into database?
Previous Message Andrew Sullivan 2002-05-24 22:08:35 Re: Case Insensitive Data Type