From: | Palle Girgensohn <girgen(at)pingpong(dot)net> |
---|---|
To: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | John Hansen <john(at)geeknet(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Patch for collation using ICU |
Date: | 2005-05-07 14:21:47 |
Message-ID: | 179E997449CCFBB86D8AEC40@palle.girgensohn.se |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
--On lördag, maj 07, 2005 10.06.43 -0400 Bruce Momjian
<pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> Palle Girgensohn wrote:
>>
>> --On l?rdag, maj 07, 2005 23.15.29 +1000 John Hansen
>> <john(at)geeknet(dot)com(dot)au> wrote:
>>
>> > Btw, I had been planning to propose replacing every single one of the
>> > built in charset conversion functions with calls to ICU (thus making pg
>> > _depend_ on ICU), as this would seem like a cleaner solution than for
>> > us to maintain our own conversion tables.
>> >
>> > ICU also has a fair few conversions that we do not have at present.
>
> That is a much larger issue, similar to our shipping our own timezone
> database. What does it buy us?
>
> o Do we ship it in our tarball?
> o Is the license compatible?
It looks pretty similar to BSD, although I'm a novice on the subject.
> o Does it remove utils/mb conversions?
Yes, it would probably be possible to remove pg's own conversions.
> o Does it allow us to index LIKE (next high char)?
I beleive so, using ICU's substring stuff.
> o Does it allow us to support multiple encodings in
> a single database easier?
Heh, the ultimate dream. Perhaps?
> o performance?
ICU in itself is said to be much faster than for example glibc. Problem is
the need for conversion via UTF-16, which requires extra memory allocations
and cpu cycles. I don't use glibc, but my very simple performance tests for
FreeBSD show that it is similiar in speed.
>
>> I just had a similar though. And why use ICU only for multibyte
>> charsets? If I use LATIN1, I still expect upper('?') => SS, and I don't
>> get it... Same for the Turkish example.
>
> We assume the native toupper() can handle single-byte character
> encodings. We use towupper() only for wide character sets.
True, problem is that native toupper/towupper run one char at the time.
This is a bad design decision in POSIX, there is no way it can handle the
examples above unless considering more than one character. ICU does just
that.
/Palle
From | Date | Subject | |
---|---|---|---|
Next Message | Marc G. Fournier | 2005-05-07 14:27:50 | Re: pgFoundry |
Previous Message | John Hansen | 2005-05-07 14:16:37 | Re: Patch for collation using ICU |