From: | Palle Girgensohn <girgen(at)pingpong(dot)net> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Patch for collation using ICU |
Date: | 2005-05-07 13:17:36 |
Message-ID: | 9CC5F10245B7612103CC03F6@palle.girgensohn.se |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> Palle Girgensohn wrote:
>> >
>> > Is this patch ready for application?
>>
>> I don't think so, not quite. I have not had any positive reports from
>> linux users, this is only tested in a FreeBSD environment. I'd say it
>> needs some more testing.
>
> OK.
John Hansen just reported that it does work on linux. fine!
>> Also, apparently, ICU is installed by default in many linux
>> distributions, and usually it is version 2.8. Some linux users have
>> asked me if there are plans for a patch that works with ICU 2.8. That's
>> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be
>> the stable version, older versions are hard to find on their sites, but
>> most linux distributers seem to consider it too bleeding edge, even
>> gentoo. I don't know why they don't agree.
>
> Good point. Why would linux folks need ICU? Doesn't their OS support
> encodings natively? I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version. Do Linux users need
> to use ICU for any reason?
There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:
-- without ICU
select upper('Eßer');
upper
-------
EßER
(1 row)
-- with ICU
select upper('Eßer');
upper
-------
ESSER
(1 rad)
This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that way.
Other known example is in Turkish, where an Ì (?) should look different
whether it is an initial letter or not. This fails in standard postgresql
for all platforms.
>> > I do have a few questions:
>> >
>> > Why don't you use the lc_ctype_is_c() part of this test?
>> >
>> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>>
>> Um, well, I didn't think about that. :) What would be the locale in
>> this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a
>> wide encoding, indeed. Then the strings will be handled like byte-wide
>> chars. Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a problem
> with some OSs that have processing problems with UTF8 when the locale is
> supposed to be turned off, like in "C". I realize ICU might be fine
> with it but the optimization still is an issue.
Well, the results are quite different, depending on whether ICU is used or
not. See separate mail.
>> > Why is so much code added, for example, in lower()? The existing
>> > multibyte code is much smaller, and lots of code is added in other
>> > places too.
>>
>> ICU uses UTF-16 internally, so all strings must be converted from the
>> database encoding to UTF-16. Since that means the strings need to be
>> copied, I took the same approach as in varlena.c:varstr_cmp(), where
>> small strings use the heap and only larger strings use a palloc.
>> Comments in varstr_cmp about performance made me use that approach.
>
> Oh, interesting. I think you need to create new functions that
> factor out that common code so the patch is smaller and easier to
> maintain.
Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...
>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where previous version didn't work. That excessive status checking is
>> hardly be necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the code bloat, together with the excessive status checking and logging.
>
> OK, move that into some common functions and I think it will be better.
Best way for upper/lower/initcap is probably to use a function pointer...
uhh...
>> > Why do you need to add a mapping of encoding names from iana to our
>> > names?
>>
>> This was already answered by John Hansen... There's an old thread here
>> about the choice of the name "UNICODE" to describe an encoding, which it
>> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
>> by postgresql, that would have been a better name... Similarly for most
>> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
>> PostgreSQL can't provide it, so a mapping table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that
> help?
I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.
>> I use this patch in production on one FreeBSD 4.10 server at the moment.
>> With the latest version, I've had no problems. Logging is swithed on for
>> now, and it shows no signs of ICU complaining. I'd like more reports on
>> Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should have
> feature freeze on July 1.
That shouldn't be a problem.
/Palle
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania
> 19073
From | Date | Subject | |
---|---|---|---|
Next Message | John Hansen | 2005-05-07 13:25:15 | Re: Patch for collation using ICU |
Previous Message | John Hansen | 2005-05-07 13:15:29 | Re: Patch for collation using ICU |