From: | Palle Girgensohn <girgen(at)pingpong(dot)net> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Patch for collation using ICU |
Date: | 2005-05-07 14:10:30 |
Message-ID: | 0B537F6953FA3B724B5761A7@palle.girgensohn.se |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> Palle Girgensohn wrote:
>> >> Also, apparently, ICU is installed by default in many linux
>> >> distributions, and usually it is version 2.8. Some linux users have
>> >> asked me if there are plans for a patch that works with ICU 2.8.
>> >> That's probably a good idea. IBM and the ICU folks seem to consider
>> >> 3.2 to be the stable version, older versions are hard to find on
>> >> their sites, but most linux distributers seem to consider it too
>> >> bleeding edge, even gentoo. I don't know why they don't agree.
>> >
>> > Good point. Why would linux folks need ICU? Doesn't their OS support
>> > encodings natively? I am particularly excited about this for OSs that
>> > don't have such encodings, like UTF8 support for Win32.
>> >
>> > Because ICU will not be used unless enabled by configure, it seems we
>> > are fine with only supporting the newest version. Do Linux users need
>> > to use ICU for any reason?
>>
>>
>> There are corner cases where it is impossible to upper/lowercase one
>> character at the time. for example:
>>
>> -- without ICU
>> select upper('E?er');
>> upper
>> -------
>> E?ER
>> (1 row)
>>
>> -- with ICU
>> select upper('E?er');
>> upper
>> -------
>> ESSER
>> (1 rad)
>>
>> This is because in the standard postgres implementation, upper/lower is
>> done one character at the time. A proper upper/lower cannot do it that
>> way. Other known example is in Turkish, where an ? (?) should look
>> different whether it is an initial letter or not. This fails in
>> standard postgresql for all platforms.
>
> Uh, where do you see that? Our code has:
>
> workspace = texttowcs(string);
>
> for (i = 0; workspace[i] != 0; i++)
> workspace[i] = towupper(workspace[i]);
as you see, the loop runs towupper for one character at the time. I cannot
consider whether the letter is the initial, as required in Turkish, and it
cannot really convert one character into two ('ß' -> 'SS')
>
> result = wcstotext(workspace, i);
>
>
>> >> Also, in the latest patch, I also added checks and logging for *every*
>> >> status returned from ICU. I hope this will help debugging on debian,
>> >> where previous version didn't work. That excessive status checking is
>> >> hardly be necessary once the stuff is better tested.
>> >>
>> >> I think the string copying and heap/palloc choices stands for most of
>> >> the code bloat, together with the excessive status checking and
>> >> logging.
>> >
>> > OK, move that into some common functions and I think it will be better.
>>
>> Best way for upper/lower/initcap is probably to use a function
>> pointer... uhh...
>
> Uh, I don't think so. Just send pointers to the the function and let
> the function allocate the memory, and another function to free them, or
> something like that. I can probably do it if you want.
I'll check it out, it seems simple enough.
>> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does
>> > that help?
>>
>> I'm aware of that. It might help for unicode, but there are a bunch of
>> other encodings. IANA has decided that utf-8 has *no* aliases, hence
>> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
>> fogiving, I don't remember/know, but I think we need the mappings,
>> unfortunately.
>
> OK. I guess I am just confused why the native implementations are OK.
They're OK since they understand that UNICODE (or UTF8) is really utf-8.
Problem is the strings used to describe them are not understood by ICU.
BTW, the pg_enc2iananame_tbl is only used *from* internal representation
*to* IANA, not the other way around. Maybe that fact lowers the rate of
confusion? ;-)
/Palle
From | Date | Subject | |
---|---|---|---|
Next Message | John Hansen | 2005-05-07 14:10:42 | Re: Patch for collation using ICU |
Previous Message | Bruce Momjian | 2005-05-07 14:07:14 | Re: Patch for collation using ICU |