Quick Links

Re: Patch for collation using ICU

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	John Hansen <john(at)geeknet(dot)com(dot)au>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch for collation using ICU
Date:	2005-05-07 13:36:57
Message-ID:	1B9F2612297F6479B9E5E7B1@palle.girgensohn.se
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

--On lördag, maj 07, 2005 23.15.29 +1000 John Hansen <john(at)geeknet(dot)com(dot)au>
wrote:

> Btw, I had been planning to propose replacing every single one of the
> built in charset conversion functions with calls to ICU (thus making pg
> _depend_ on ICU), as this would seem like a cleaner solution than for us
> to maintain our own conversion tables.
>
> ICU also has a fair few conversions that we do not have at present.
>
> Any thoughts?

I just had a similar though. And why use ICU only for multibyte charsets?
If I use LATIN1, I still expect upper('ß') => SS, and I don't get it...
Same for the Turkish example.

It does eat more memory, and can perhaps cush some performance bits? With
the current scheme, a strdup is often enough, or at least just one palloc.
With ICU, using UTF-16, you must allocate memory twice, once for the ICU
internal UTF-16 representation. That's not a very strong objection, though,
as this would be an option... :)

John, I have a hard time finding docs about what differs in ICU 2.8 from
3.2. Do you have any pointers?

It seems 3.2 has much more support and bug fixes, I'm not sure if we should
really consider 2.8?

/Palle

>
> ... John
>
>> -----Original Message-----
>> From: John Hansen
>> Sent: Saturday, May 07, 2005 11:09 PM
>> To: 'Palle Girgensohn'; 'Bruce Momjian'
>> Cc: 'pgsql-hackers(at)postgresql(dot)org'
>> Subject: RE: [HACKERS] Patch for collation using ICU
>>
>> > --On lördag, maj 07, 2005 22.53.46 +1000 John Hansen
>> > <john(at)geeknet(dot)com(dot)au>
>> > wrote:
>> >
>> > > Errm,... initdb --encoding UNICODE --locale C
>> >
>> > You mean that ICU *shall* be used even for the C locale, and not as
>> > Bruce suggested here:
>>
>> Yes, that's exactly what I mean.
>>
>> >
>> > >> I do have a few questions:
>> > >>
>> > >> Why don't you use the lc_ctype_is_c() part of this test?
>> > >>
>> > >> if (pg_database_encoding_max_length() > 1 &&
>> !lc_ctype_is_c())
>> > >
>> > > Um, well, I didn't think about that. :) What would be the
>> > locale in
>> > > this case? c_C.UTF-8? ;) Hmm, it is possible to have
>> > CTYPE=C and use
>> > > a wide encoding, indeed. Then the strings will be handled
>> > like byte-wide chars.
>> > > Yeah, it's a bug. I'll fix it! Thanks.
>> >
>> > John disagrees here, and I'm obliged to agree. Using the C
>> locale, one
>> > will expect C collation, but upper/lower is better off still using
>> > ICU. Hence, the above stuff is *not* a bug. Do we agree?
>> >
>> > /Palle
>> >
>> >
>> > >
>> > >> -----Original Message-----
>> > >> From: pgsql-hackers-owner(at)postgresql(dot)org
>> > >> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of
>> > John Hansen
>> > >> Sent: Saturday, May 07, 2005 10:23 PM
>> > >> To: Palle Girgensohn; Bruce Momjian
>> > >> Cc: pgsql-hackers(at)postgresql(dot)org
>> > >> Subject: Re: [HACKERS] Patch for collation using ICU
>> > >>
>> > >> >
>> > >> > I use this patch in production on one FreeBSD 4.10
>> server at the
>> > >> > moment.
>> > >> > With the latest version, I've had no problems. Logging is
>> > >> swithed on
>> > >> > for now, and it shows no signs of ICU complaining. I'd
>> like more
>> > >> > reports on Linux, though.
>> > >>
>> > >> I currently use this on gentoo with ICU3.2 unmasked.
>> > >>
>> > >> Works a dream, even with locale C and UNICODE database.
>> > >>
>> > >> Small test:
>> > >>
>> > >> createdb --encoding UNICODE --locale C test psql test set
>> > >> client_encoding=iso88591; CREATE TABLE test (t text);
>> INSERT INTO
>> > >> test (t) VALUES ('æøå'); set client_encoding=unicode;
>> INSERT INTO
>> > >> test (t) SELECT upper(t) FROM test; set
>> client_encoding=iso88591;
>> > >> SELECT * FROM test;
>> > >> t
>> > >> -----
>> > >> æøå
>> > >> ÆØÅ
>> > >> (2 rows)
>> > >>
>> > >> Just as I'd expect, as upper/lower/initcap are locale
>> > independent for
>> > >> these characters.
>> > >>
>> > >>
>> > >> ---------------------------(end of
>> > >> broadcast)---------------------------
>> > >> TIP 5: Have you checked our extensive FAQ?
>> > >>
>> > >> http://www.postgresql.org/docs/faq
>> > >>
>> > >>
>> >
>> >
>> >
>> >
>> >
>> >

In response to

Re: Patch for collation using ICU at 2005-05-07 13:15:29 from John Hansen

Responses

Re: Patch for collation using ICU at 2005-05-07 14:06:43 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Palle Girgensohn	2005-05-07 13:38:04	Re: Patch for collation using ICU
Previous Message	John Hansen	2005-05-07 13:33:31	Re: Patch for collation using ICU