From: | "John Hansen" <john(at)geeknet(dot)com(dot)au> |
---|---|
To: | "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Palle Girgensohn" <girgen(at)pingpong(dot)net> |
Cc: | <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Patch for collation using ICU |
Date: | 2005-05-07 13:07:07 |
Message-ID: | 5066E5A966339E42AA04BA10BA706AE50A92FD@rodrick.geeknet.com.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Bruce Momjian wrote:
> Palle Girgensohn wrote:
> > >
> > > Is this patch ready for application?
> >
> > I don't think so, not quite. I have not had any positive
> reports from
> > linux users, this is only tested in a FreeBSD environment.
> I'd say it
> > needs some more testing.
>
> OK.
>
> > Also, apparently, ICU is installed by default in many linux
> > distributions, and usually it is version 2.8. Some linux users have
> > asked me if there are plans for a patch that works with ICU 2.8.
> > That's probably a good idea. IBM and the ICU folks seem to consider
> > 3.2 to be the stable version, older versions are hard to
> find on their
> > sites, but most linux distributers seem to consider it too bleeding
> > edge, even gentoo. I don't know why they don't agree.
>
> Good point. Why would linux folks need ICU? Doesn't their
> OS support encodings natively? I am particularly excited
> about this for OSs that don't have such encodings, like UTF8
> support for Win32.
>
> Because ICU will not be used unless enabled by configure, it
> seems we are fine with only supporting the newest version.
> Do Linux users need to use ICU for any reason?
Yes, because on many linux platforms locale support is broken.
Also, ICU enables full unicode support, particularly in multi-language
situations where locale is C, and makes upper/lower/initcap work as
expected, except where it depends on locale information.
There are also many other useful things in ICU that could be
implemented. Transliteration, and break-iterators for example.
Break-iteration particularly interresting for converting a text to a
list of words. Another is it's builtin substring searches.
>
> > > I do have a few questions:
> > >
> > > Why don't you use the lc_ctype_is_c() part of this test?
> > >
> > > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c())
> >
> > Um, well, I didn't think about that. :) What would be the
> locale in
> > this case? c_C.UTF-8? ;) Hmm, it is possible to have
> CTYPE=C and use
> > a wide encoding, indeed. Then the strings will be handled
> like byte-wide chars.
> > Yeah, it's a bug. I'll fix it! Thanks.
>
> The additional test is more of an optmization, and it fixes a
> problem with some OSs that have processing problems with UTF8
> when the locale is supposed to be turned off, like in "C". I
> realize ICU might be fine with it but the optimization still
> is an issue.
That the locale is supposed to be turned off, doesn't mean it shouldn't
use ICU.
ICU is more than just locales.
> > > Why is so much code added, for example, in lower()? The existing
> > > multibyte code is much smaller, and lots of code is added
> in other
> > > places too.
> >
> > ICU uses UTF-16 internally, so all strings must be
> converted from the
> > database encoding to UTF-16. Since that means the strings
> need to be
> > copied, I took the same approach as in
> varlena.c:varstr_cmp(), where
> > small strings use the heap and only larger strings use a palloc.
> > Comments in varstr_cmp about performance made me use that approach.
>
> Oh, interesting. I think you need to create new functions that
> factor out that common code so the patch is smaller and
> easier to maintain.
>
> > Also, in the latest patch, I also added checks and logging
> for *every*
> > status returned from ICU. I hope this will help debugging
> on debian,
> > where previous version didn't work. That excessive status
> checking is
> > hardly be necessary once the stuff is better tested.
> >
> > I think the string copying and heap/palloc choices stands
> for most of
> > the code bloat, together with the excessive status checking
> and logging.
>
> OK, move that into some common functions and I think it will
> be better.
>
> > > Why do you need to add a mapping of encoding names from
> iana to our
> > > names?
> >
> > This was already answered by John Hansen... There's an old
> thread here
> > about the choice of the name "UNICODE" to describe an
> encoding, which
> > it doesn't. There's half a dozen unicode based encodings...
> UTF-8 is
> > used by postgresql, that would have been a better name... Similarly
> > for most other encodings, really. ICU expect a setlocale(3) string
> > (i.e. IANA). PostgreSQL can't provide it, so a mapping
> table is required.
>
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).
> Does that help?
>
> > I use this patch in production on one FreeBSD 4.10 server
> at the moment.
> > With the latest version, I've had no problems. Logging is
> swithed on
> > for now, and it shows no signs of ICU complaining. I'd like more
> > reports on Linux, though.
>
> OK, I certainly would like this all done for 8.1 which should
> have feature freeze on July 1.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square,
> Pennsylvania 19073
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | John Hansen | 2005-05-07 13:08:42 | Re: Patch for collation using ICU |
Previous Message | Palle Girgensohn | 2005-05-07 13:03:22 | Re: Patch for collation using ICU |