Quick Links

SOLVED: Re: UTF-8 and stripping accents

From:	"Christopher Murtagh" <christopher(dot)murtagh(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Cc:	"Mike Rylander" <mrylander(at)gmail(dot)com>
Subject:	SOLVED: Re: UTF-8 and stripping accents
Date:	2006-06-15 19:22:09
Message-ID:	92fbb7920606151222w44b9c604pe6853d7b06e03b66@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hey, I solved my own problem! I'm posting here because while I was
looking for solutions, I found tons of folks tackling the same
problem, most didn't find the solution or had to do cumbersome
'translate()'s to get what they wanted.

The difference between my 7.4.6 and 8.1.4 DBs was that 7.4.6 had
UNICODE as it's encoding, whereas the 8.1.4 was UTF8. So, the 7.4.6
needs the decode and the 8.1.4 doesn't.

Also, I had to escape the '\' in the regex.

So, for the record, to strip out all accents from UTF8 encoded text:

CREATE OR REPLACE FUNCTION strip_accents(text) RETURNS text
AS '
use Unicode::Normalize;
use Encode;

my $string = NFD($_[0]);
$string =~ s/\\p{Mn}//ogsm;
return NFC($string);
'
LANGUAGE plperlu;

For the 7.4.6 DB whose encoding was UNICODE, a slight difference:

CREATE OR REPLACE FUNCTION strip_accents(text) RETURNS text
AS '
use Unicode::Normalize;
use Encode;

my $string = NFD(decode( utf8 => $_[0]));
$string =~ s/\\p{Mn}//ogsm;
return NFC($string);
'
LANGUAGE plperlu;

I hope this is of some use to other folks here. Thanks to Mike
Rylander for the initial code.

Cheers,

Chris

Responses

Re: SOLVED: Re: UTF-8 and stripping accents at 2006-06-15 19:43:13 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Milen Kulev	2006-06-15 19:31:21	Partitioning and sub-partitioning problems
Previous Message	Shoaib Mir	2006-06-15 19:15:23	Re: postgres password