Quick Links

Re: Searching for "bare" letters

From:	Eduardo Morras <nec556(at)retena(dot)com>
To:	"Reuven M(dot) Lerner" <reuven(at)lerner(dot)co(dot)il>,pgsql-general(at)postgresql(dot)org
Subject:	Re: Searching for "bare" letters
Date:	2011-10-02 21:03:18
Message-ID:	4E5EF2B1008AF013@
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

At 01:25 02/10/2011, Reuven M. Lerner wrote:

>Hi, everyone.Â I'm working on a project on
>PostgreSQL 9.0 (soon to be upgraded to 9.1,
>given that we haven't yet launched).Â The
>project will involve numerous text fields
>containing English, Spanish, and
>Portuguese.Â Some of those text fields will be
>searchable by the user.Â That's easy enough to
>do; for our purposes, I was planning to use some
>combination of LIKE searches; the database is
>small enough that this doesn't take very much
>time, and we don't expect the number of
>searchable records (or columns within those records) to be all that large.
>
>The thing is, the people running the site want
>searches to work on what I'm calling (for lack
>of a better term) "bare" letters.Â That is, if
>the user searches for "n", then the search
>should also match Spanish words containing
>"Ã±".Â I'm told by Spanish-speaking members of
>the team that this is how they would expect
>searches to work.Â However, when I just did a
>quick test using a UTF-8 encoded 9.0 database, I
>found that PostgreSQL didn'tÂ see the two
>characters as identical.Â (I must say, this is
>the behavior that I would have expected, had the
>Spanish-speaking team member not said anything on the subject.)
>
>So my question is whether I can somehow wrangle
>PostgreSQL into thinking that "n" and "Ã±" are
>the same character for search purposes, or if I
>need to do something else -- use regexps, keep a
>"naked," searchable version of each column
>alongside the native one, or something else entirely -- to get this to work.
>
>Any ideas?

You can use perceptual hashing for that. There
are multiple algorithms, some of them can be tuned for specific languages.

See this documentation:

http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,

http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,

http://en.wikipedia.org/wiki/Metaphone is a
family of several modern algorithms.

Remember that they are hashing algorithms, some
words can collide because they have the same pronunciation but write different.

I remember that datapark search engine uses them
with dictionaries. You can check it too.

http://www.dataparksearch.org/

>Thanks,
>
>Reuven

HTH

In response to

Searching for "bare" letters at 2011-10-01 23:25:03 from Reuven M. Lerner

Browse pgsql-general by date

	From	Date	Subject
Next Message	Peter Eisentraut	2011-10-03 04:12:25	Re: pg_upgrade 8.4 -> 9.1 failures
Previous Message	Joseph S	2011-10-02 19:45:33	pg_upgrade 8.4 -> 9.1 failures