Re: Unicode string literals versus the world

From: Marko Kreen <markokr(at)gmail(dot)com>
To: Sam Mason <sam(at)samason(dot)me(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode string literals versus the world
Date: 2009-04-16 15:34:06
Message-ID: e51f66da0904160834m5d5bed5aj510b679230be0f7b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/16/09, Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:
> On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote:
> > On 4/16/09, Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:
> > > Microsoft have also gone this way in C#, named code points are not
> > > supported however.
> >
> > And it handles also non-BMP codepoints with \u escape similarly:
> >
> > http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
> >
> > This makes it even more standard.
>
>
> I fail to see what you're pointing out here; as far as I understand it,
> \u is for BMP code points and \U extends the range out to 32bit code
> points. I can't see anything about non-BMP and \u in the above link,
> you appear free to write your own surrogate pairs but that seems like an
> independent issue.

Ok, maybe I glanced too quickly over that page.

I can't find definite deference only hint on several pages:

\U \Unnnnnnnn Unicode escape sequence for surrogate pairs.

Which hints that you can aswell enter the pairs directly: \uxx\uxx.
If I'd be language designer, I would not see any reason to disallow it.

And anyway, at least mono seems to support it:

using System;
public class HelloWorld {
public static void Main() {
Console.WriteLine("<\uD800\uDF02>\n");
}
}

It will output single UTF8 character. I think this should settle it.

> I'd not realised before that C# is specified to use UTF-16 as its
> internal encoding.
>
> > > This would be following the BitC[2] project, especially if it was more
> > > like:
> > >
> > > \{U+xxxx}
> >
>
> > We already got yet-another-unique-way-of-escaping-unicode with U&.
> >
> > Now let's try to support some actual standard also.
>
>
> That comes across *very* negatively; I hope it's just a language issue.
>
> I read your parent post as soliciting opinions on possible ways to
> encode Unicode characters in PG's literals. The U&'lit' was criticised,
> you posted some suggestions, I followed up with what I hoped to be a
> useful addition. It seems useful here to separate "de jure" from "de
> facto" standards; implementing U&'lit' would be following the de jure
> standard, anything else would be de facto.
>
> A survey of existing SQL implementations would seem to be more appropriate
> as well:
>
> Oracle: UNISTR(string-literal) and \xxxx
>
> It looks as though Oracle originally used UCS-2 internally (i.e. BMP
> only) but more recently Unicode support has been improved to allow
> other planes.
>
> MS-SQL Server:
>
> can't find anything remotely useful; best seems to be to use
> NCHAR(integer-expression) which looks somewhat unmaintainable.
>
> DB2: U&string-literal and \xxxxxx
>
> i.e. it follows the SQL-2003 spec
>
> FireBird:
>
> can't find much either; support looks somewhat low on the ground
>
> MySQL:
>
> same again, seems to assume query is encoded in UTF-8
>
> Summary seems to be that either I'm bad at searching or support for
> Unicode doesn't seem very complete in the database world and people work
> around it somehow.

The de-facto about Postgres is stdstr=off. Even if not, E'' strings
are still better for various things, so it would be good if they also
aquired unicode-capabilities.

> > You did not read my mail carefully enough - the Java and also Python/C#
> > already support non-BMP chars with '\u' and exactly the same (utf16) way.
>
>
> Again, I think this may be a language issue; if not then more verbose
> explanations help, maybe something like "sorry, I obviously didn't
> explain that very well". You will of course felt you explained it
> perfectly well, but everybody enters a discussion with different
> intuitions and biases, email has a nasty habit of accentuating these
> differences and compounding them with language problems.
>
> I'd never heard of UTF-16 surrogate pairs before this discussion and
> hence didn't realise that it's valid to have a surrogate pair in place
> of a single code point. The docs say that <D800 DF02> corresponds to
> U+10302, Python would appear to follow my intuitions in that:
>
> ord(u'\uD800\uDF02')
>
> results in an error instead of giving back 66306, as I'd expect. Is
> this a bug in Python, my understanding, or something else?

Python's internal representation is *not* UTF-16, but plain UCS2/UCS4,
that is - plain 16 or 32-bit values. Seems your python is compiled with
UCS2, not UCS4. As I understand, in UCS2 mode it simply takes surrogate
values as-is. From ord() docs:

If a unicode argument is given and Python was built with UCS2 Unicode,
then the character’s code point must be in the range [0..65535]
inclusive; otherwise the string length is two, and a TypeError will
be raised.

So only in UCS4 mode it detects surrogates and converts them to internal
representation. (Which in Postgres case would be UTF8.)

Or perhaps it is partially UTF16 aware - eg. I/O routines do unterstand
UTF16 but low-level string routines do not:

print "<%s>" % u'\uD800\uDF02'

seems to handle it properly.

--
marko

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2009-04-16 15:34:27 Re: Unicode string literals versus the world
Previous Message Sam Mason 2009-04-16 15:24:42 Re: Unicode string literals versus the world