Re: Unicode string literals versus the world

From: Marko Kreen <markokr(at)gmail(dot)com>
To: Sam Mason <sam(at)samason(dot)me(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unicode string literals versus the world
Date: 2009-04-16 11:47:20
Message-ID: e51f66da0904160447m764b0ee9i925b45b00320d084@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/16/09, Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:
> On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
> > On 4/15/09, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > > Given Martijn's complaint about more-than-16-bit code points, I think
> > > the \u proposal is not mature enough to go into 8.4. We can think
> > > about some version of that later, if there's enough interest.
> >
> > I think it would be good idea. Basically we should pick one from
> > couple of pre-existing sane schemes. Here is quick summary
> > of Python, Perl and Java:
> >
> > Python [1]:
> >
> > \uXXXX - 16-bit codepoint
> > \UXXXXXXXX - 32-bit codepoint
> > \N{char-name} - Characted by name
>
>
> Microsoft have also gone this way in C#, named code points are not
> supported however.

And it handles also non-BMP codepoints with \u escape similarly:

http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences

This makes it even more standard.

> > Perl [2]:
> >
> > \x{XXXX..} - {} contains hexadecimal codepoint
> > \N{char-name} - Unicode char name
>
>
> Looks OK, but the 'x' seems somewhat redundant. Why not just:
>
> \{xxxx}
>
> This would be following the BitC[2] project, especially if it was more
> like:
>
> \{U+xxxx}
>
> e.g.
>
> \{U+03BB}
>
> would be the lowercase lambda character. Added appeal is in the fact
> that this (i.e. U+03BB) is how the Unicode consortium spells code
> points.

We already got yet-another-unique-way-of-escaping-unicode with U&.

Now let's try to support some actual standard also.

> > Java [3]:
> >
> > \uXXXX - 16-bit codepoint
>
>
> AFAIK, Java isn't the best reference to choose; it assumed from an early
> point in its design that Unicode characters were at most 16bits and
> hence had to switch its internal representation to UTF-16. I don't
> program much Java these days to know how it's all worked out, but it
> would be interesting to hear from people who regularly have to deal with
> characters outside the BMP (i.e. code points greater than 65535).

You did not read my mail carefully enough - the Java and also Python/C#
already support non-BMP chars with '\u' and exactly the same (utf16) way.

--
marko

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Field 2009-04-16 12:09:06 Re: Postgres SQL specification (tests)
Previous Message Sam Mason 2009-04-16 11:44:53 Re: Performance of full outer join in 8.3