Re: to_ascii, or some other form of magic transliteration

From: Mike Rylander <mrylander(at)gmail(dot)com>
To: Ben <bench(at)silentmedia(dot)com>, Postgresql-General <pgsql-general(at)postgresql(dot)org>
Subject: Re: to_ascii, or some other form of magic transliteration
Date: 2005-09-11 14:57:51
Message-ID: b918cf3d05091107571f8f0974@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 9/10/05, Ben <bench(at)silentmedia(dot)com> wrote:
> Hrm, I must be missing something, because I don't see how this will
> transliterate to ASCII?

If you want non-western text to be Romanized you can take a look at
Text::Unicode(1). The functionality in the chunk of perl I sent
before was stripping of non spacing mark (accents, rings, umlauts and
such). You may need to strip other character classes if you've got
unicode punctuation codepoints in the text to be searched.

For the example you gave, the process is to decompose the character
"å" to normalization form D, "a" and the unicode non spacing mark for
the ring, and then removing the non spacing mark (the ring diacritic)
with the regex s/\pM//sog. That will leave just the ASCII "a" in the
text, and the text can the be treated as pure ASCII, because it no
longer contains any unicode codepoints with an ord() above 127. You
may want to look here(2) for an explanation and examples of Unicode
normalization forms.

If you don't need that much functionality (handling arbitrary unicode
text), and you're dealing strictly with the Latin1 subset of unicode,
you can just create a mapping table or hash to transliterate down to
ASCII, as done here(3).

1) http://cpan.uwinnipeg.ca/htdocs/Text-Unidecode/Text/Unidecode.html
2) http://www.unicode.org/unicode/reports/tr15/#Canonical_Composition_Examples
3) http://www.eprints.org/files/eprints2/eprints-2.2/defaultcfg/ArchiveTextIndexingConfig.pm

>
> On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote:
>
> > On 9/9/05, Ben <bench(at)silentmedia(dot)com> wrote:
> >
> >> I'm working on a problem that I imagine others have had, which
> >> basically
> >> boils down to having nice unicode display text that users are
> >> going to
> >> want to search against without typing it correctly.... e.g. let a
> >> search
> >> for "sma" match "små". It seems like the best way to do this is to
> >> find
> >> a magic unicode transliteration mapping function, and then save the
> >> ASCII transliterations for searching against.
> >>
> >>
> >
> > The simplest solution to this that I've found is to maintain a
> > separate column for ASCII-ized version of your text. The conversion
> > can be done automatically using a trigger, and I have one in PL/PERLU
> > that I use. It basically boils down to:
> >
> > 1) transform unicode text to normal form D
> > 2) strip combining non-spacing marks
> >
> > In modern Perls that looks like:
> >
> > #--------------
> > use Unicode::Normalize;
> > my $txt = NFD(shift());
> > $txt =~ s/\pM//og;
> > return $txt;
> > #--------------
> >
> > Hope that helps!
> >
> >
>

--
Mike Rylander
mrylander(at)gmail(dot)com
GPLS -- PINES Development
Database Developer
http://open-ils.org

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Alex 2005-09-11 16:10:43 Function to test for Valid Date
Previous Message Ron Mayer 2005-09-11 12:52:07 Re: Postgresql Hosting