From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Vick Khera <vivek(at)khera(dot)org> |
Cc: | pgsql-general <pgsql-general(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Subject: | Re: How well does PostgreSQL 9.6.1 support unicode? |
Date: | 2016-12-21 16:31:38 |
Message-ID: | 19113.1482337898@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Vick Khera <vivek(at)khera(dot)org> writes:
> On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI <
> horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> A PostgreSQL database with encoding=UTF8 just accepts the whole
>> range of Unicode, regardless that a character is defined for the
>> code or not.
> Interesting... when I converted my application and database to utf8
> encoding, I discovered that Postgres is picky about UTF-8. Specifically the
> UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d.
> This looks like a proper character but in fact is not a defined character
> code point.
Well, we're picky to the extent that RFC 3629 tells us to be picky:
http://www.faqs.org/rfcs/rfc3629.html
The case you mention is rejected because it would be half of a UTF16
"surrogate pair", which should not be used in any Unicode representation
other than UTF16; if we allowed it then there would be more than one way
to represent the same Unicode code point, which is undesirable for a lot
of reasons.
> So I think when you present an actual string of UTF8 encoded characters,
> Postgres does refuse characters unknown. However, as you observe, inserting
> the unicode code point directly does not produce an error:
> insert into unicode(id, string) values(1, U&'\d80d');
> INSERT 0 1
Hm. I think that's a bug. The lexer does know that \d80d is half of a
surrogate pair, and it expects the second half to come next. If you
supply something that isn't the second half of a surrogate pair, you
get an error as expected:
u8=# insert into unicode(id, string) values(1, U&'\d80dfoo');
ERROR: invalid Unicode surrogate pair at or near "foo'"
LINE 1: insert into unicode(id, string) values(1, U&'\d80dfoo');
^
But it looks like if you just end the string after the first half of a
surrogate, it just drops the character without complaint. Notice that
what got inserted was a zero-length string, not U+D08D:
u8=# select *, length(string) from unicode;
id | string | length
----+--------+--------
1 | | 0
(1 row)
I'd have expected a syntax error along the line of "incomplete Unicode
surrogate pair". Peter, I think this was your code to begin with ---
was it intentional to not raise error here, or is that an oversight?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Adrian Klaver | 2016-12-21 16:58:58 | Re: error updating a tuple after promoting a standby |
Previous Message | Tom DalPozzo | 2016-12-21 16:17:03 | Re: error updating a tuple after promoting a standby |