Quick Links

Re: Patch for collation using ICU

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 13:17:36
Message-ID:	9CC5F10245B7612103CC03F6@palle.girgensohn.se
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

--On lördag, maj 07, 2005 08.37.05 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:

> Palle Girgensohn wrote:
>> >
>> > Is this patch ready for application?
>>
>> I don't think so, not quite. I have not had any positive reports from
>> linux users, this is only tested in a FreeBSD environment. I'd say it
>> needs some more testing.
>
> OK.

John Hansen just reported that it does work on linux. fine!

>> Also, apparently, ICU is installed by default in many linux
>> distributions, and usually it is version 2.8. Some linux users have
>> asked me if there are plans for a patch that works with ICU 2.8. That's
>> probably a good idea. IBM and the ICU folks seem to consider 3.2 to be
>> the stable version, older versions are hard to find on their sites, but
>> most linux distributers seem to consider it too bleeding edge, even
>> gentoo. I don't know why they don't agree.
>
> Good point. Why would linux folks need ICU? Doesn't their OS support
> encodings natively? I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version. Do Linux users need
> to use ICU for any reason?

There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:

-- without ICU
select upper('Eßer');
upper
-------
EßER
(1 row)

-- with ICU
select upper('Eßer');
upper
-------
ESSER
(1 rad)

This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that way.
Other known example is in Turkish, where an Ì (?) should look different
whether it is an initial letter or not. This fails in standard postgresql
for all platforms.

>> > I do have a few questions:
>> >
>> > Why don't you use the lc_ctype_is_c() part of this test?
>> >
>> > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
>>
>> Um, well, I didn't think about that. :) What would be the locale in
>> this case? c_C.UTF-8? ;) Hmm, it is possible to have CTYPE=C and use a
>> wide encoding, indeed. Then the strings will be handled like byte-wide
>> chars. Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a problem
> with some OSs that have processing problems with UTF8 when the locale is
> supposed to be turned off, like in "C". I realize ICU might be fine
> with it but the optimization still is an issue.

Well, the results are quite different, depending on whether ICU is used or
not. See separate mail.

>> > Why is so much code added, for example, in lower()? The existing
>> > multibyte code is much smaller, and lots of code is added in other
>> > places too.
>>
>> ICU uses UTF-16 internally, so all strings must be converted from the
>> database encoding to UTF-16. Since that means the strings need to be
>> copied, I took the same approach as in varlena.c:varstr_cmp(), where
>> small strings use the heap and only larger strings use a palloc.
>> Comments in varstr_cmp about performance made me use that approach.
>
> Oh, interesting. I think you need to create new functions that
> factor out that common code so the patch is smaller and easier to
> maintain.

Hmm, yes, perhaps it can be refactored a bit. It has ocurred to me...

>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where previous version didn't work. That excessive status checking is
>> hardly be necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the code bloat, together with the excessive status checking and logging.
>
> OK, move that into some common functions and I think it will be better.

Best way for upper/lower/initcap is probably to use a function pointer...
uhh...

>> > Why do you need to add a mapping of encoding names from iana to our
>> > names?
>>
>> This was already answered by John Hansen... There's an old thread here
>> about the choice of the name "UNICODE" to describe an encoding, which it
>> doesn't. There's half a dozen unicode based encodings... UTF-8 is used
>> by postgresql, that would have been a better name... Similarly for most
>> other encodings, really. ICU expect a setlocale(3) string (i.e. IANA).
>> PostgreSQL can't provide it, so a mapping table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does that
> help?

I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence only
utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.

>> I use this patch in production on one FreeBSD 4.10 server at the moment.
>> With the latest version, I've had no problems. Logging is swithed on for
>> now, and it shows no signs of ICU complaining. I'd like more reports on
>> Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should have
> feature freeze on July 1.

That shouldn't be a problem.

/Palle

>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania
> 19073

In response to

Re: Patch for collation using ICU at 2005-05-07 12:37:05 from Bruce Momjian

Responses

Re: Patch for collation using ICU at 2005-05-07 13:52:59 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	John Hansen	2005-05-07 13:25:15	Re: Patch for collation using ICU
Previous Message	John Hansen	2005-05-07 13:15:29	Re: Patch for collation using ICU