From: | Sergey Burladyan <eshkinkot(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | UTF8 regexp and char classes still does not work |
Date: | 2010-09-28 21:35:00 |
Message-ID: | 877hi5a6wr.fsf@home.progtech.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I see this in 9.0 Release note:
- Support locale-specific regular expression processing with UTF-8
server encoding (Tom Lane)
Locale-specific regular expression functionality includes
case-insensitive matching and locale-specific character classes.
But character classes still does not work, example (git REL9_0_STABLE c767c3bd):
select version();
version
------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-8) 4.4.5 20100728 (prerelease), 64-bit
--- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
?column? | ?column? | ?column?
----------+----------+----------
t | f | t
all must be true, like below:
create database koi8 template template0 encoding 'koi8r' lc_collate 'ru_RU.KOI8-R' lc_ctype 'ru_RU.KOI8-R';
\c koi8
set client_encoding TO utf8;
select E'\326' ~* E'\366', E'\326' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
?column? | ?column? | ?column?
----------+----------+----------
t | t | t
As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but
this pg_wc_isalpha is called from
static struct cvec *
cclass(struct vars * v, /* context */
const chr *startp, /* where the name starts */
const chr *endp, /* just past the end of the name */
int cases) /* case-independent? */
function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes" and it call pg_wc_isalpha like this:
for (i = 0; i <= UCHAR_MAX; i++)
{
if (pg_wc_isalpha((chr) i))
addchr(cv, (chr) i);
}
UCHAR_MAX is 255
I do not understand fully this algorithm of regular expressions, but i think cclass function also need fix.
--
Sergey Burladyan
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2010-09-28 21:39:52 | Re: documentation udpates to pgupgrade.html |
Previous Message | Tom Lane | 2010-09-28 21:03:27 | Re: Proposal: plpgsql - "for in array" statement |