Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Terry Laurenzo <tj(at)laurenzo(dot)org>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Date: 2010-10-24 04:57:05
Message-ID: AANLkTikn=YjzJ0vEQPOr19eQSypr=0JuPTVgGUKkZ1bv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 23, 2010 at 8:33 PM, Terry Laurenzo <tj(at)laurenzo(dot)org> wrote:
>>
>> I'm still going to write up a proposed grammar that takes these items into
>> account - just ran out of time tonight.
>>
>
> The binary format I was thinking of is here:
>
>   http://github.com/tlaurenzo/pgjson/blob/master/pgjson/shared/include/json/jsonbinary.h
> This was just a quick brain dump and I haven't done a lot of diligence on
> verifying it, but I think it should be more compact than most JSON text
> payloads and quick to iterate over/update in sibling traversal order vs
> depth-first traversal which is what we would get out of JSON text.
> Thoughts?
> Terry

It doesn't do particularly well on my previous example of [1,2,3]. It
comes out slightly shorter on ["a","b","c"] and better if the strings
need any escaping. I don't think the float4 and float8 formats are
very useful; how could you be sure that the output was going to look
the same as the input? Or alternatively that dumping a particular
object to text and reloading it will produce the same internal
representation? I think it would be simpler to represent integers
using a string of digits; that way you can be assured of going from
text -> binary -> text without change.

Perhaps it would be enough to define the high two bits as follows: 00
= array, 01 = object, 10 = string, 11 = number/true/false/null. The
next 2 bits specify how the length is stored. 00 = remaining 4 bits
store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
bytes store a full 32-bit length word. Then, the array, object, and
string representations can work as you've specified them. Anything
else can be represented by itself, or perhaps we should say that
numbers represent themselves and true/false/null are represented by a
1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to
mean those values, since there's no obvious reason for the low bits to
be non-zero if a 4-bit length word ostensibly follows).

So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b' 81 'c'

(I am still worried about the serialization/deserialization overhead
but that's a different issue.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-10-24 05:11:13 Re: WIP: extensible enums
Previous Message Tom Lane 2010-10-24 04:26:07 Re: WIP: extensible enums