Re: Accent insensitive search

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "PFC" <lists(at)peufeu(dot)com>
Cc: Diego Manilla Suárez <diego(dot)manilla(at)xeridia(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Accent insensitive search
Date: 2007-06-21 11:05:10
Message-ID: 877ipx1sbd.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


"PFC" <lists(at)peufeu(dot)com> writes:

>> Hi. I have a few databases created with UNICODE encoding, and I would like to
>> be able to search with accent insensitivity. There's something in Oracle
>> (NLS_COMP, NLS_SORT) and SQL Server (don't remember) to do this, but I found
>> nothing in PostgreSQL, just the 'to_ascii' function, which AFAIK, doesn't
>> work with UNICODE.

Postgres supports localized collation orders but what it doesn't currently
support is having multiple collation orders within the same server.

So when you initialize the server with initdb it takes the setting of
LC_COLLATE (usually from LC_ALL) and stores that permanently. If you initdb
with LC_COLLATE set to a locale like en_GB.UTF-8 or something like that you
may find one that has the behaviour you want. I think they won't be entirely
accent insensitive but they'll consider accents only if the rest of the string
is identical.

You can test the sort order of a locale by writing a file with sample words
and sorting with something like:

LC_ALL=en_GB.UTF-8 sort /tmp/data

> The easiest way is to create an extra column which will hold a copy of
> your text, with all accents removed. You can also convert it to lowercase and
> remove apostrophes, punctuation etc. Said column is kept up to date with a
> trigger.

That's another alternative which is useful if you need multiple collations in
your database. This gives you control over which collation is used when and
exactly what the rules are. The downside is that you have to reinvent the
collation rules which the localized collations already provide.

You don't necessarily have to keep a column in your table with the normalized
strings. You can normalize "on-the-fly" using an expression index as long as
your function always returns the same data given the same inputs (and is
therefore marked "immutable").

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Gregory Stark 2007-06-21 11:35:06 Re: 8.2.3 PANIC with "corrupted item pointer"
Previous Message Naz Gassiep 2007-06-21 11:01:42 Aggregates