Re: Patch for collation using ICU

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: john(at)geeknet(dot)com(dot)au
Cc: pgman(at)candle(dot)pha(dot)pa(dot)us, girgen(at)pingpong(dot)net, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch for collation using ICU
Date: 2005-05-09 14:32:00
Message-ID: 20050509.233200.71085686.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > -----Original Message-----
> > From: Tatsuo Ishii [mailto:t-ishii(at)sra(dot)co(dot)jp]
> > Sent: Sunday, May 08, 2005 11:08 PM
> > To: John Hansen
> > Cc: pgman(at)candle(dot)pha(dot)pa(dot)us; girgen(at)pingpong(dot)net;
> > pgsql-hackers(at)postgresql(dot)org
> > Subject: Re: [HACKERS] Patch for collation using ICU
> >
> > > > I don't buy it. If current conversion tables does the
> > right thing,
> > > > why we need to replace. Or if conversion tables are not
> > correct, why
> > > > don't you fix it? I think the rule of character
> > conversion will not
> > > > change frequently, especially for LATIN languages. Thus
> > maintaining
> > > > cost is not too high.
> > >
> > > I never said we need to, but if we're going to implement
> > ICU, then we
> > > might as well go all the way.
> >
> > So you admit there's no benefit using ICU for replacing
> > existing conversions?
> >
> > Besides ICU does not support all existing conversions, I
> > think ICU has serious flaw for using conversion. If I
> > understand correctly, ICU uses UNICODE internally to do the
> > conversion. For example, to implement
> > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > converts UNICODE to EUC_JP. Problem is these conversion is
> > not roud trip(conversion between SJIS/EUC_JP and UNICODE will
> > lose some information). Thus SJIS->EUC_JP->SJIS conversion
> > using ICU does not preserve original text.
>
> Just for the record, I fetched a web page encoded in sjis, and converted
> it to euc-jp and back using uconv from ICU 3.2, and the result is the
> original is identical to the transformed file.
>
> uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html
> uconv -f EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc
> diff index.html index.html.sjis

Not all SJIS/EUC_JP characters have the problem. You might want to
try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.

BTW, I got this with ICU 3.2:

$ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt
Conversion from Unicode to codepage failed at input byte position 0. Unicode: 301c Error: Invalid character found

The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.

This makes me nervous in using ICU...
--
Tatsuo Ishii

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Louisa Thue - Navarik 2005-05-09 16:17:22 unsubscribe
Previous Message Marc G. Fournier 2005-05-09 14:30:48 Re-packaging releases ...