From: | "Christopher Murtagh" <christopher(dot)murtagh(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | UTF-8 and stripping accents |
Date: | 2006-06-15 17:08:47 |
Message-ID: | 92fbb7920606151008x2838b627tab5f55bb6e7c56b4@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Greetings folks,
I'm trying to write a stored procedure that strips accents from UTF-8
encoded text. I saw a thread on this list discussing something very
similar to this on April 8th, and used it to start. However, I'm
getting odd behaviour.
My stored procedure:
CREATE OR REPLACE FUNCTION strip_accents(text) RETURNS text
AS '
use Unicode::Normalize;
use Encode;
my $string = NFD( decode( utf8 => $_[0] ) );
$string =~ s/\p{Mn}+//ogsm;
return NFC($string);
'
LANGUAGE plperlu;
I'm trying this on two different postgres dbs. One is pg 7.4.6, the
other is 8.1.4 and they both break in different ways.
On the 8.1.4:
test=# select strip_accents('This is Québec, français, noël, à la mode');
-[ RECORD 1 ]-+------------------------------------------
strip_accents | This is Qu�bec, fran�ais, no�l, � la mod
(not sure how this will arrive to the list, but basically all accented
characters are repliaced with a cedile)
and if I try a 'select strip_accents(column) from table;' in a UTF8
encoded database I get:
ERROR: error from Perl function: Cannot decode string with wide
characters at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm
line 166.
On the 7.4.6, I simply get the input without any changes for both the
direct input and for a column.
test=# select strip_accents('This is Québec, français, noël, à la mode');
-[ RECORD 1 ]-+------------------------------------------
strip_accents | This is Québec, français, noël, à la mod
Now, on both of these machines, I have the following simple perl script:
[chris(at)mafalda ~]$ cat strip_accents.pl
#!/usr/bin/perl
use Unicode::Normalize;
use Encode;
my $string = NFD( decode( utf8 => $ARGV[0] ) );
$string =~ s/\p{Mn}+//ogsm;
print NFC($string)."\n";
When executed, it behaves as expected:
[chris(at)mafalda ~]$ ./strip_accents.pl 'This is Québec, français, noël,
à la mode'
This is Quebec, francais, noel, a la mode
So, I'm obviously doing something dumb/wrong with encodings, but I
can't for the life of me figure it out. I've tried setting client
encodings, verifying database encodings, etc.. all to no avail. Is
there something obvious that I'm missing? Is there a better way to
achieve what I'm trying to do?
Thanks in advance for any insight.
Cheers,
Chris
From | Date | Subject | |
---|---|---|---|
Next Message | Nitin Verma | 2006-06-15 17:32:54 | VACUUMing sometimes increasing database size / sometimes crashing it |
Previous Message | Relyea, Mike | 2006-06-15 16:46:27 | Re: Out of memory error in 8.1.0 Win32 |