Re: Best practices for moving UTF8 databases

From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Best practices for moving UTF8 databases
Date: 2009-07-21 10:35:47
Message-ID: 20090721103547.GN5407@samason.me.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
> >I'd love to fix them. But if I do a search for
> >SELECT * FROM xyz WHERE col like '%0x80%'
> >
> >it doesn't work. How should I search for these characters?
>
> In 8.2, try: WHERE strpos(col, E'\x80') > 0
>
> Note that this may find valid data as well, because the error you get
> is when 0x80 is the first byte of a character in UTF8; when it's at
> another position, you don't want to change it.

There are various regexs around to check for valid UTF-8 encoding; one
appears to be:

http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

One translation into PG would be:

WHERE NOT col ~ ( '^('||
$$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII
$$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte
$$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs
$$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte
$$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates
$$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3
$$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15
$$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16
'*)$' );

This seems to do the right thing for me in an SQL_ASCII database.

--
Sam http://samason.me.uk/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message tanjunhua 2009-07-21 10:39:00 ***UNCHECKED*** Re: memory leak occur when disconnect database
Previous Message tanjunhua 2009-07-21 10:28:19 Re: memory leak occur when disconnect database