From: | Marko Kreen <markokr(at)gmail(dot)com> |
---|---|
To: | Sam Mason <sam(at)samason(dot)me(dot)uk> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Unicode string literals versus the world |
Date: | 2009-04-16 11:47:20 |
Message-ID: | e51f66da0904160447m764b0ee9i925b45b00320d084@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 4/16/09, Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:
> On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
> > On 4/15/09, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > > Given Martijn's complaint about more-than-16-bit code points, I think
> > > the \u proposal is not mature enough to go into 8.4. We can think
> > > about some version of that later, if there's enough interest.
> >
> > I think it would be good idea. Basically we should pick one from
> > couple of pre-existing sane schemes. Here is quick summary
> > of Python, Perl and Java:
> >
> > Python [1]:
> >
> > \uXXXX - 16-bit codepoint
> > \UXXXXXXXX - 32-bit codepoint
> > \N{char-name} - Characted by name
>
>
> Microsoft have also gone this way in C#, named code points are not
> supported however.
And it handles also non-BMP codepoints with \u escape similarly:
http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences
This makes it even more standard.
> > Perl [2]:
> >
> > \x{XXXX..} - {} contains hexadecimal codepoint
> > \N{char-name} - Unicode char name
>
>
> Looks OK, but the 'x' seems somewhat redundant. Why not just:
>
> \{xxxx}
>
> This would be following the BitC[2] project, especially if it was more
> like:
>
> \{U+xxxx}
>
> e.g.
>
> \{U+03BB}
>
> would be the lowercase lambda character. Added appeal is in the fact
> that this (i.e. U+03BB) is how the Unicode consortium spells code
> points.
We already got yet-another-unique-way-of-escaping-unicode with U&.
Now let's try to support some actual standard also.
> > Java [3]:
> >
> > \uXXXX - 16-bit codepoint
>
>
> AFAIK, Java isn't the best reference to choose; it assumed from an early
> point in its design that Unicode characters were at most 16bits and
> hence had to switch its internal representation to UTF-16. I don't
> program much Java these days to know how it's all worked out, but it
> would be interesting to hear from people who regularly have to deal with
> characters outside the BMP (i.e. code points greater than 65535).
You did not read my mail carefully enough - the Java and also Python/C#
already support non-BMP chars with '\u' and exactly the same (utf16) way.
--
marko
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Field | 2009-04-16 12:09:06 | Re: Postgres SQL specification (tests) |
Previous Message | Sam Mason | 2009-04-16 11:44:53 | Re: Performance of full outer join in 8.3 |