From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | John Hansen <john(at)geeknet(dot)com(dot)au> |
Cc: | Palle Girgensohn <girgen(at)pingpong(dot)net>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Patch for collation using ICU |
Date: | 2005-05-07 14:34:24 |
Message-ID: | 200505071434.j47EYOg05130@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
John Hansen wrote:
> Bruce Momjian wrote:
> >
> > There are two reasons for that optimization --- first, some
> > locale support is broken and Unicode encoding with a C locale
> > crashes (not an issue for ICU), and second, it is an
> > optimization for languages like Japanese that want to use
> > unicode, but don't need a locale because upper/lower means
> > nothing in those character sets.
>
> No, upper/lower means nothing in those languages, so why would you need
> to optimize upper/lower if they're not used??
True. I suppose it is for databases that use both Japanese and Latin
alphabets and run upper() on all values.
> And if they are, it's obviously because the text contains characters
> from other languages (probably english) and as such they should behave
> correctly.
>
> Did I mention that for japanese and the like, ICU would also offer
> transliteration...
Interesting.
> > So, the first issue doesn't apply for ICU, and the second
> > might not depending on what characters you are using in the
> > Unicode character set.
> >
> > I guess I am little confused how ICU can do upper() when the
> > locale is C. What is it using to determine A is upper for a?
> > Am I confused?
>
> Simple, UNICODE basically consist of a table of characters
> (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)
>
> Excerpt:
>
> 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
> ...
> 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
>
> >From this you can see, that for 0041, which is capital letter A, there
> is a mapping to it's lowercase counterpart, 0061
> Likewise, there is a mapping for 0061 which says it's uppercase
> counterpart is 0041.
> There is also SpecialCasing.txt which covers those mappings that haven't
> got a 1-1 mapping, such as the german SS.
>
> These mappings are fixed, independent of locale, only a few cases from
> specialcasing.txt depend on locale/context.
As far as I know, the only way to use Unicode currently is to use a
locale that is unicode-aware.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Sullivan | 2005-05-07 14:56:42 | Re: pl/pgsql enabled by default |
Previous Message | Marc G. Fournier | 2005-05-07 14:27:50 | Re: pgFoundry |