Re: Unicode normalization

From: Andreas Kalsch <aka(at)aka-fotos(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Unicode normalization
Date: 2009-09-16 19:40:45
Message-ID: 4AB13F3D.20202@aka-fotos.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Update: The error is of course: The function tries to return "str"
instead of unicode. It is not str.decode('UTF-8') which causes the error.

Andreas Kalsch schrieb:
> No,
>
> I need a solution which is as generic as possible. I use UTF-8 encoded
> unicode strings on all levels. This is what I have done so far:
>
>
> 1) Writing a separate Python command line script for testing - works
> as expected:
>
> #!/usr/bin/python
>
> import sys
> import unicodedata
>
> str = sys.argv[1].decode('UTF-8')
> str = unicodedata.normalize('NFKD', str)
> str = ''.join(c for c in str if unicodedata.combining(c) == 0)
> print str
>
>
> 2) Transfering this to PL/Python:
>
> CREATE OR REPLACE FUNCTION test (str text)
> RETURNS text
> AS $$
> import unicodedata
> return unicodedata.normalize('NFKD', str.decode('UTF-8'))
> $$ LANGUAGE plpythonu;
>
> Problem: plpython throws an error, where my commandline script did it
> correctly:
>
> # select test('aÄÖÜ');
>
> ERROR: plpython: function "test" could not create return value
> DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
> encode character u'\u0308' in position 2: ordinal not in range(128)
>
>
>
> I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like
> in a normal python environment?
>
>
> In the end it should look like this:
>
> CREATE TABLE t (
> ...
> ts ts_vector NOT NULL
> );
>
> INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));
>
> Andi
>
>
> David Fetter schrieb:
>> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>>
>>> Has somebody integrated Unicode normalization into Postgres? if not,
>>> I would have to implement my own function by using this CPAN
>>> module: http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>>
>>> I need a function which removes all diacritics (1) and transforms
>>> some characters to a more compatible form (2) to get a better index
>>> on strings.
>>>
>>> Best,
>>>
>>> Andi
>>>
>>>
>>> 1) à,ä, ... => a
>>> 2) ø => o, ƒ => f, ª => a
>>>
>>
>> You mean something like this?
>>
>> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>>
>>
>> Cheers,
>> David.
>>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Scott Marlowe 2009-09-16 20:09:39 wrong error from ./configure in pgsql 8.3.8 for libxml
Previous Message Andreas Kalsch 2009-09-16 19:35:02 Re: Unicode normalization