Re: Patch for collation using ICU

From: Palle Girgensohn <girgen(at)pingpong(dot)net>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-07 14:10:30
Message-ID: 0B537F6953FA3B724B5761A7@palle.girgensohn.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> Palle Girgensohn wrote:
>> >> Also, apparently, ICU is installed by default in many linux
>> >> distributions, and usually it is version 2.8. Some linux users have
>> >> asked me if there are plans for a patch that works with ICU 2.8.
>> >> That's probably a good idea. IBM and the ICU folks seem to consider
>> >> 3.2 to be the stable version, older versions are hard to find on
>> >> their sites, but most linux distributers seem to consider it too
>> >> bleeding edge, even gentoo. I don't know why they don't agree.
>> >
>> > Good point. Why would linux folks need ICU? Doesn't their OS support
>> > encodings natively? I am particularly excited about this for OSs that
>> > don't have such encodings, like UTF8 support for Win32.
>> >
>> > Because ICU will not be used unless enabled by configure, it seems we
>> > are fine with only supporting the newest version. Do Linux users need
>> > to use ICU for any reason?
>>
>>
>> There are corner cases where it is impossible to upper/lowercase one
>> character at the time. for example:
>>
>> -- without ICU
>> select upper('E?er');
>> upper
>> -------
>> E?ER
>> (1 row)
>>
>> -- with ICU
>> select upper('E?er');
>> upper
>> -------
>> ESSER
>> (1 rad)
>>
>> This is because in the standard postgres implementation, upper/lower is
>> done one character at the time. A proper upper/lower cannot do it that
>> way. Other known example is in Turkish, where an ? (?) should look
>> different whether it is an initial letter or not. This fails in
>> standard postgresql for all platforms.
>
> Uh, where do you see that? Our code has:
>
> workspace = texttowcs(string);
>
> for (i = 0; workspace[i] != 0; i++)
> workspace[i] = towupper(workspace[i]);

as you see, the loop runs towupper for one character at the time. I cannot
consider whether the letter is the initial, as required in Turkish, and it
cannot really convert one character into two ('ß' -> 'SS')

>
> result = wcstotext(workspace, i);
>
>
>> >> Also, in the latest patch, I also added checks and logging for *every*
>> >> status returned from ICU. I hope this will help debugging on debian,
>> >> where previous version didn't work. That excessive status checking is
>> >> hardly be necessary once the stuff is better tested.
>> >>
>> >> I think the string copying and heap/palloc choices stands for most of
>> >> the code bloat, together with the excessive status checking and
>> >> logging.
>> >
>> > OK, move that into some common functions and I think it will be better.
>>
>> Best way for upper/lower/initcap is probably to use a function
>> pointer... uhh...
>
> Uh, I don't think so. Just send pointers to the the function and let
> the function allocate the memory, and another function to free them, or
> something like that. I can probably do it if you want.

I'll check it out, it seems simple enough.

>> > We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does
>> > that help?
>>
>> I'm aware of that. It might help for unicode, but there are a bunch of
>> other encodings. IANA has decided that utf-8 has *no* aliases, hence
>> only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
>> fogiving, I don't remember/know, but I think we need the mappings,
>> unfortunately.
>
> OK. I guess I am just confused why the native implementations are OK.

They're OK since they understand that UNICODE (or UTF8) is really utf-8.
Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation
*to* IANA, not the other way around. Maybe that fact lowers the rate of
confusion? ;-)

/Palle

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Hansen 2005-05-07 14:10:42 Re: Patch for collation using ICU
Previous Message Bruce Momjian 2005-05-07 14:07:14 Re: Patch for collation using ICU