From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, John Hansen <john(at)geeknet(dot)com(dot)au>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> |
Subject: | Three-byte Unicode characters |
Date: | 2005-04-10 13:51:59 |
Message-ID: | 200504101351.j3ADpxX05679@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
[ This email to hackers from last night got lost so I am remailing.]
Tom Lane wrote:
> "John Hansen" <john(at)geeknet(dot)com(dot)au> writes:
> >> That is backpatched to 8.0.X. Does that not fix the problem reported?
>
> > No, as andrew said, what this patch does, is allow values > 0xffff and
> > at the same time validates the input to make sure it's valid utf8.
>
> The impression I get is that most of the 'Unicode characters above
> 0x10000' reports we've seen did not come from people who actually needed
> more-than-16-bit Unicode codepoints, but from people who had screwed up
> their encoding settings and were trying to tell the backend that Latin1
> was Unicode or some such. So I'm a bit worried that extending the
> backend support to full 32-bit Unicode will do more to mask encoding
> mistakes than it will do to create needed functionality.
>
> Not that I'm against adding the functionality. I'm just doubtful that
> the reports we've seen really indicate that we need it, or that adding
> it will cut down on the incidence of complaints :-(
OK, I got on the IRC server and talked to folks who actually understand
this. They say there are Chinese who are reporting this problem, so I
Googled and found this:
http://www.yale.edu/chinesemac/pages/charset_encoding.html#Unicode
See the paragraph with "Supplementary Ideographic Plane". You will see
that paragraph says:
The Supplementary Ideographic Plane (SIP) currently contains 42,711
additional characters in "CJK Unified Ideographs Extension B"
(U+20000-2A6D6). The PDF chart for this is available at:
http://www.unicode.org/charts/PDF/U20000.pdf
I assume it is that U+20000-2A6D6 range that people are complaining
about.
So, we do have a bug, and we are probably going to need to fix it in
8.0.X.
I apologize to people who reported this problem and I wasn't attentive
to the seriousness of it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2005-04-10 14:47:26 | Re: Three-byte Unicode characters |
Previous Message | Oliver Jowett | 2005-04-10 05:54:16 | Re: prepared statements don't log arguments? |