Re: JSON and unicode surrogate pairs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: JSON and unicode surrogate pairs
Date: 2013-06-10 15:20:13
Message-ID: 51B5EEAD.50208@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 06/10/2013 10:18 AM, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> After thinking about this some more I have come to the conclusion that
>> we should only do any de-escaping of \uxxxx sequences, whether or not
>> they are for BMP characters, when the server encoding is utf8. For any
>> other encoding, which is already a violation of the JSON standard
>> anyway, and should be avoided if you're dealing with JSON, we should
>> just pass them through even in text output. This will be a simple and
>> very localized fix.
> Hmm. I'm not sure that users will like this definition --- it will seem
> pretty arbitrary to them that conversion of \u sequences happens in some
> databases and not others.

Then what should we do when there is no matching codepoint in the
database encoding? First we'll have to delay the evaluation so it's not
done over-eagerly, and then we'll have to try the conversion and throw
an error if it doesn't work. The second part is what's happening now,
but the delayed evaluation is not.

Or we could abandon the conversion altogether, but that doesn't seem
very friendly either. I suspect the biggest case for people to use these
sequences is where the database is UTF8 but the client encoding is not.

Frankly, if you want to use Unicode escapes, you should really be using
a UTF8 encoded database if at all possible.

>
>> We'll still have to deal with this issue when we get to binary storage
>> of JSON, but that's not something we need to confront today.
> Well, if we have to break backwards compatibility when we try to do
> binary storage, we're not going to be happy either. So I think we'd
> better have a plan in mind for what will happen then.
>
>

I don't see any reason why we couldn't store the JSON strings with the
Unicode escape sequences intact in the binary format. What the binary
format buys us is that it has decomposed the JSON into a tree structure,
so instead of parsing the JSON we can just walk the tree, but the leaf
nodes of the tree are still (in the case of the nodes under discussion)
text-like objects.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2013-06-10 15:27:09 Re: SPGist "triple parity" concept doesn't work
Previous Message Dimitri Fontaine 2013-06-10 15:19:31 Re: erroneous restore into pg_catalog schema