From: | Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> |
---|---|
To: | Sam Mason <sam(at)samason(dot)me(dot)uk> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Best practices for moving UTF8 databases |
Date: | 2009-07-22 09:26:37 |
Message-ID: | e373d31e0907220226h6bd5c33ag5ac2c63fa0be241f@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Tue, Jul 21, 2009 at 6:35 PM, Sam Mason<sam(at)samason(dot)me(dot)uk> wrote:
> On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
>> >I'd love to fix them. But if I do a search for
>> >SELECT * FROM xyz WHERE col like '%0x80%'
>> >
>> >it doesn't work. How should I search for these characters?
>>
>> In 8.2, try: WHERE strpos(col, E'\x80') > 0
>>
>> Note that this may find valid data as well, because the error you get
>> is when 0x80 is the first byte of a character in UTF8; when it's at
>> another position, you don't want to change it.
>
> There are various regexs around to check for valid UTF-8 encoding; one
> appears to be:
>
> http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex
>
> One translation into PG would be:
>
> WHERE NOT col ~ ( '^('||
> $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII
> $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte
> $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs
> $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte
> $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates
> $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3
> $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15
> $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16
> '*)$' );
>
> This seems to do the right thing for me in an SQL_ASCII database.
>
I tried this. Get an error.
mypg=# select * from interesting WHERE NOT description ~ ( '^('||
mypg(# $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII
mypg(# $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte
mypg(# $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs
mypg(# $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte
mypg(# $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates
mypg(# $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3
mypg(# $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15
mypg(# $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16
mypg(# '*)$' )
mypg-#
mypg-# ;
ERROR: invalid regular expression: quantifier operand invalid
From | Date | Subject | |
---|---|---|---|
Next Message | groovefillet | 2009-07-22 10:44:04 | enabling join_collapse_limit for a single query only |
Previous Message | Chris Spotts | 2009-07-22 06:18:26 | Re: array_agg crash? |