| From: | Bruce Momjian <bruce(at)momjian(dot)us> | 
|---|---|
| To: | Jean-Baptiste Quenot <jbq(at)caraldi(dot)com> | 
| Cc: | pgsql-bugs(at)postgresql(dot)org | 
| Subject: | Re: BUG #4200: Regexp character classes not UTF8-compliant | 
| Date: | 2008-05-29 00:04:14 | 
| Message-ID: | 200805290004.m4T04Eb15568@momjian.us | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-bugs | 
I am not sure how to help you except to say that UTF8 is a character set
encoding, while en_US.UTF-8 is more of an encoding with a locale.  My
guess is that if you use *.UTF-8 where you specified the proper
localization language, it would work.
http://www.postgresql.org/docs/8.2/static/locale.html
---------------------------------------------------------------------------
Jean-Baptiste Quenot wrote:
> 
> The following bug has been logged online:
> 
> Bug reference:      4200
> Logged by:          Jean-Baptiste Quenot
> Email address:      jbq(at)caraldi(dot)com
> PostgreSQL version: 8.3.1
> Operating system:   Linux Ubuntu Hardy
> Description:        Regexp character classes not UTF8-compliant
> Details: 
> 
> PostgreSQL documentation at
> http://www.postgresql.org/docs/8.3/static/functions-matching.html describes
> the various character classes, and they can be used to match or replace
> strings with regexp support.  However, the [:alnum:] and [:alpha:] character
> classes are not UTF8-compliant, like shown in the examples below:
> 
> dockee=# show client_encoding;
>  client_encoding 
> -----------------
>  UTF8
> (1 row)
> 
> dockee=# show lc_ctype;
>   lc_ctype   
> -------------
>  en_US.UTF-8
> (1 row)
> 
> dockee=# select regexp_replace('bbu', '[[:alnum:]]', '', 'g');
>  regexp_replace 
> ----------------
>  
> (1 row)
> 
> ovhdev=# select regexp_replace('bbu', '[[:alpha:]]', '', 'g');
>  regexp_replace 
> ----------------
>  
> (1 row)
> 
> dockee=# select regexp_replace('bbu', $$\w$$, '', 'g');
>  regexp_replace 
> ----------------
>  
> (1 row)
> 
> Only characters in the ASCII range were correctly detected to belong to the
> [:alnum:] character class, whereas other characters are valid too.
> 
> -- 
> Sent via pgsql-bugs mailing list (pgsql-bugs(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
-- 
  Bruce Momjian  <bruce(at)momjian(dot)us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Thomas H. | 2008-05-29 00:50:50 | Re: BUG #4186: set lc_messages does not work | 
| Previous Message | Tom Lane | 2008-05-28 23:20:26 | Re: BUG #4186: set lc_messages does not work |