From: | Terry Laurenzo <tj(at)laurenzo(dot)org> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP) |
Date: | 2010-10-24 06:21:27 |
Message-ID: | AANLkTinCTALsqJTDkQprBHzYztD5k+KaFnOq4pYXf9W_@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
>
> It doesn't do particularly well on my previous example of [1,2,3]. It
> comes out slightly shorter on ["a","b","c"] and better if the strings
> need any escaping. I don't think the float4 and float8 formats are
> very useful; how could you be sure that the output was going to look
> the same as the input? Or alternatively that dumping a particular
> object to text and reloading it will produce the same internal
> representation? I think it would be simpler to represent integers
> using a string of digits; that way you can be assured of going from
> text -> binary -> text without change.
>
> Perhaps it would be enough to define the high two bits as follows: 00
> = array, 01 = object, 10 = string, 11 = number/true/false/null. The
> next 2 bits specify how the length is stored. 00 = remaining 4 bits
> store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
> byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
> additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
> bytes store a full 32-bit length word. Then, the array, object, and
> string representations can work as you've specified them. Anything
> else can be represented by itself, or perhaps we should say that
> numbers represent themselves and true/false/null are represented by a
> 1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to
> mean those values, since there's no obvious reason for the low bits to
> be non-zero if a 4-bit length word ostensibly follows).
>
> So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b'
> 81 'c'
>
> (I am still worried about the serialization/deserialization overhead
> but that's a different issue.)
>
>
Thanks. I'll play around with the bit and numeric encodings you've
recommended. Arrays are certainly the toughest to economize on as a text
encoding has minimum of 2 + n - 1 overhead bytes. Text encoding for objects
has some more wiggle room with 2 + (n-1) + (n*3) extra bytes. I was
admittedly thinking about more complicated objects.
I'm still worried about transcoding overhead as well. If comparing to a
simple blind storage of JSON text with no validation or normalization, there
is obviously no way to beat a straight copy. However, its not outside the
realm of reason to think that it may be possible to match or beat the clock
if comparing against text to text normalization, or perhaps adding slight
overhead to a validator. The advantage to the binary structure is the
ability to scan it hierarchically in sibling order and provide mutation
operations with simple memcpy's as opposed to parse_to_ast -> modify_ast ->
serialize_ast.
Terry
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2010-10-24 07:32:13 | Re: ask for review of MERGE |
Previous Message | Tallat Mahmood | 2010-10-24 05:34:51 | Re: xlog.c: WALInsertLock vs. WALWriteLock |