Re: JSON and unicode surrogate pairs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: JSON and unicode surrogate pairs
Date: 2013-06-10 17:01:29
Message-ID: 51B60669.3080107@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 06/10/2013 11:43 AM, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Or we could abandon the conversion altogether, but that doesn't seem
>> very friendly either. I suspect the biggest case for people to use these
>> sequences is where the database is UTF8 but the client encoding is not.
> Well, if that's actually the biggest use-case, then maybe we should just
> say we're *not* in the business of converting those escapes. That would
> make things nice and consistent regardless of the DB encoding, and it
> would avoid the problem of being able to input a value and then not
> being able to output it again.
>
> It's legal, is it not, to just write the equivalent Unicode character in
> the JSON string and not use the escapes? If so I would think that that
> would be the most common usage. If someone's writing an escape, they
> probably had a reason for doing it that way, and might not appreciate
> our overriding their decision.
>
>

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead
of JSON, and in those cases, when the value returned is a JSON string,
we do the following to it:

* strip the outside quotes
* de-escape the various escaped characters (i.e. everything preceded
by a backslash in the railroad diagram for string at
<http://www.json.org/>)

Here's an example of the difference:

andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
?column?
----------
"\u00a9"
(1 row)

andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
?column?
----------
©
(1 row)

It's the process of producing the latter that is giving us a headache in
non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger
than that. We also use this value for field name comparison. So, let us
suppose that we have a LATIN1 database and a piece of JSON with a field
name containing the Euro sign ("\u20ac"), a character that is not in
LATIN1. Making that processable so it doesn't blow up would be mighty
tricky and error prone. The non-orthogonality I suggested as a solution
upthread is, by contrast, very small and easy to manage, and not
terribly hard to explain - see attached.

cheers

andrew

Attachment Content-Type Size
json-unicode-esc.patch text/x-patch 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2013-06-10 17:10:12 Re: pg_filedump 9.3: checksums (and a few other fixes)
Previous Message Tom Lane 2013-06-10 16:20:45 Re: SPGist "triple parity" concept doesn't work