BUG #7999: Regexp with utf8

From: somloieater(at)gmail(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #7999: Regexp with utf8
Date: 2013-03-27 10:32:57
Message-ID: E1UKnf7-0005Sa-L4@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 7999
Logged by: david
Email address: somloieater(at)gmail(dot)com
PostgreSQL version: 9.1.8
Operating system: linux
Description:

\y and \Y do not behave correctly next to
multibyte utf-8 characters - they seem to invert their sensesː

Propper behaivour with ascii e
'es'~$$\y[eɛ]s$$ => t
Inverted behaviour with epsilon
'ɛs'~$$\y[eɛ]s$$ => f
'ɛs'~$$[eɛ]\ys$$ => t
'ɛs'~$$[eɛ]\Ys$$ => f

This seems to be a case of utf8 characters not being recognised as
word-forming:

'ɛ'~$$\w'$$ => f

I've checked with a few other characters which are >1byte in utf8. U+00F0
counds as \w, but nothing I've tried > FF matches. I wonder if it's
something to do with >256?

In case anyone else hits this bug, replacing \y with
(^|$|\s|[[:punct:]]) seems to work for me, although it's ugly.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message roberto.menoncin 2013-03-27 13:07:48 BUG #8000: ExclusiveLock on a simple SELECT ?
Previous Message John R Pierce 2013-03-26 18:40:57 Re: BUG #7998: Could not able to connect database