| From: | Sam Mason <sam(at)samason(dot)me(dot)uk> | 
|---|---|
| To: | pgsql-general(at)postgresql(dot)org | 
| Subject: | Re: Best practices for moving UTF8 databases | 
| Date: | 2009-07-21 10:35:47 | 
| Message-ID: | 20090721103547.GN5407@samason.me.uk | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
> >I'd love to fix them. But if I do a search for
> >SELECT * FROM xyz WHERE col like '%0x80%'
> >
> >it doesn't work. How should I search for these characters?
> 
> In 8.2, try: WHERE strpos(col, E'\x80') > 0
> 
> Note that this may find valid data as well, because the error you get 
> is when 0x80 is the first byte of a character in UTF8; when it's at 
> another position, you don't want to change it.
There are various regexs around to check for valid UTF-8 encoding; one
appears to be:
http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex
One translation into PG would be:
  WHERE NOT col ~ ( '^('||
    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
   '*)$' );
This seems to do the right thing for me in an SQL_ASCII database.
-- 
  Sam  http://samason.me.uk/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | tanjunhua | 2009-07-21 10:39:00 | ***UNCHECKED*** Re: memory leak occur when disconnect database | 
| Previous Message | tanjunhua | 2009-07-21 10:28:19 | Re: memory leak occur when disconnect database |