Re: Unicode normalization

From: Andreas Kalsch <andreaskalsch(at)gmx(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Unicode normalization
Date: 2009-09-16 19:35:02
Message-ID: 4AB13DE6.3040800@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

No,

I need a solution which is as generic as possible. I use UTF-8 encoded
unicode strings on all levels. This is what I have done so far:

1) Writing a separate Python command line script for testing - works as
expected:

#!/usr/bin/python

import sys
import unicodedata

str = sys.argv[1].decode('UTF-8')
str = unicodedata.normalize('NFKD', str)
str = ''.join(c for c in str if unicodedata.combining(c) == 0)
print str

2) Transfering this to PL/Python:

CREATE OR REPLACE FUNCTION test (str text)
RETURNS text
AS $$
import unicodedata
return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;

Problem: plpython throws an error, where my commandline script did it
correctly:

# select test('aÄÖÜ');

ERROR: plpython: function "test" could not create return value
DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
encode character u'\u0308' in position 2: ordinal not in range(128)

I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in
a normal python environment?

In the end it should look like this:

CREATE TABLE t (
...
ts ts_vector NOT NULL
);

INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));

Andi

David Fetter schrieb:
> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>
>> Has somebody integrated Unicode normalization into Postgres? if not, I
>> would have to implement my own function by using this CPAN module:
>> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>
>> I need a function which removes all diacritics (1) and transforms some
>> characters to a more compatible form (2) to get a better index on
>> strings.
>>
>> Best,
>>
>> Andi
>>
>>
>> 1) à,ä, ... => a
>> 2) ø => o, ƒ => f, ª => a
>>
>
> You mean something like this?
>
> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>
> Cheers,
> David.
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Kalsch 2009-09-16 19:40:45 Re: Unicode normalization
Previous Message David Fetter 2009-09-16 19:01:49 Re: Unicode normalization