From: | Andreas Kalsch <andreaskalsch(at)gmx(dot)de> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | How to simplify unicode strings |
Date: | 2009-09-16 23:37:47 |
Message-ID: | 4AB176CB.9080004@gmx.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Thank you Sam,
this leaded to the correct solution:
CREATE OR REPLACE FUNCTION simplify (str text)
RETURNS text
AS $$
import unicodedata
s = unicodedata.normalize('NFKD', str.decode('UTF-8'))
s = ''.join(c for c in s if unicodedata.combining(c) == 0)
return s.encode('UTF-8')
$$ LANGUAGE plpythonu;
test=# select simplify('Français va à Paris, () {} [] µ @ º Ångstrøm
Phiat-im hû-hō sī phiat tī 1-ê ki-chhó· jī-bó bīn-téng ê hû-hō. Siōng
phó·-phiàn ê kong-lêng sī kái-piàn ki-chhó· jī-bó ê hoat-im.');
simplify
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Francais va a Paris, () {} [] μ @ o Angstrøm Phiat-im hu-ho si phiat ti
1-e ki-chho· ji-bo bin-teng e hu-ho. Siong pho·-phian e kong-leng si
kai-pian ki-chho· ji-bo e hoat-im.
(1 row)
One question remains: How is the performance of PL/Python?
When there are syntax errors in the Python code, they are not reported
on CREATE, because the function seems be recompiled on every call.
This leads to the next question: When will the unicode stuff included in
the main distribution?
Andi
Sam Mason schrieb:
> On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote:
>
>> CREATE OR REPLACE FUNCTION test (str text)
>> RETURNS text
>> AS $$
>> import unicodedata
>> return unicodedata.normalize('NFKD', str.decode('UTF-8'))
>> $$ LANGUAGE plpythonu;
>>
>
> I'd guess you want that to be:
>
> return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8');
>
> If you're converting from a utf8 encoding, you probably need to go
> back again! This could certainly be made easier though, PG knows what
> encoding its strings are stored in, why doesn't it work with unicode
> strings by default?
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Chris Barnes | 2009-09-17 00:22:13 | oom ( kernel: postgres invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 ) |
Previous Message | Scott Marlowe | 2009-09-16 23:01:40 | Re: Unicode normalization |