Quick Links

Re: UTF16 surrogate pairs in UTF8 encoding

From:	Florian Weimer <fweimer(at)bfk(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: UTF16 surrogate pairs in UTF8 encoding
Date:	2010-08-23 06:50:35
Message-ID:	82vd71kekk.fsf@mid.bfk.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

* Tom Lane:

> I just noticed that we are now advertising the ability to insert UTF16
> surrogate pairs in strings and identifiers (see section 4.1.2.2 in
> current docs, in particular). Is this really wise? I thought that
> surrogate pairs were specifically prohibited in UTF8 strings, because
> of the security hazards implicit in having more than one way to
> represent the same code point.

There is relatively little risk because surrogate pairs cannot encode
characters in the BMP, and presumably, most of the critical characters
are located there.

However, if this is converted to regular UTF-8, I really question the
sense of this. Usually, people want CESU-8 to preserve ordering
between languages such as C# and Java and their database, and
conversion destroys this property.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

In response to

UTF16 surrogate pairs in UTF8 encoding at 2010-08-22 18:29:20 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Magnus Hagander	2010-08-23 08:50:14	Re: git: uh-oh
Previous Message	Tom Lane	2010-08-23 02:57:38	Re: pg_archivecleanup debug message consistency