From: | NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp> |
---|---|
To: | pgsql-hackers(at)postgreSQL(dot)org |
Cc: | NISHIYAMA Tomoaki <tomoakin(at)staff(dot)kanazawa-u(dot)ac(dot)jp> |
Subject: | Re: Notes about fixing regexes and UTF-8 (yet again) |
Date: | 2012-02-18 09:29:57 |
Message-ID: | E4F0A52A-AA30-40CB-86A4-D795AB33DC64@staff.kanazawa-u.ac.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I don't believe it is valid to ignore CJK characters above U+20000.
If it is used for names, it will be stored in the database.
If the behaviour is different from characters below U+FFFF, you will
get a bug report in meanwhile.
see
CJK Extension B, C, and D
from
http://www.unicode.org/charts/
Also, there are some code points that could be regarded alphabet and numbers
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
On the other hand, it is ok if processing of characters above U+10000 is very slow,
as far as properly processed, because it is considered rare.
On 2012/02/17, at 23:56, Andrew Dunstan wrote:
>
>
> On 02/17/2012 09:39 AM, Tom Lane wrote:
>> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> Here's a wild idea: keep the class of each codepoint in a hash table.
>>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
>>> string contains a character that's not in the hash table yet, query the
>>> class of that character, and add it to the hash table. Then recompile
>>> the whole regex and restart the matching engine.
>>> Recompiling is expensive, but if you cache the results for the session,
>>> it would probably be acceptable.
>> Dunno ... recompiling is so expensive that I can't see this being a win;
>> not to mention that it would require fundamental surgery on the regex
>> code.
>>
>> In the Tcl implementation, no codepoints above U+FFFF have any locale
>> properties (alpha/digit/punct/etc), period. Personally I'd not have a
>> problem imposing the same limitation, so that dealing with stuff above
>> that range isn't really a consideration anyway.
>
>
> up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large number of special characters." It seems very likely to be acceptable not to bother about the locale of code points in the supplementary planes.
>
> See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved.
>
>
> cheers
>
> andrew
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2012-02-18 10:47:23 | pg_regress application_name |
Previous Message | Tom Lane | 2012-02-18 02:17:27 | Re: Notes about fixing regexes and UTF-8 (yet again) |