From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Terry Laurenzo <tj(at)laurenzo(dot)org> |
Cc: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP) |
Date: | 2010-10-24 12:12:02 |
Message-ID: | AANLkTimV=t-xZrXs+cGK=c7QGq3XoYo53CKT7jnJB_m2@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Oct 24, 2010 at 2:21 AM, Terry Laurenzo <tj(at)laurenzo(dot)org> wrote:
>> It doesn't do particularly well on my previous example of [1,2,3]. It
>> comes out slightly shorter on ["a","b","c"] and better if the strings
>> need any escaping. I don't think the float4 and float8 formats are
>> very useful; how could you be sure that the output was going to look
>> the same as the input? Or alternatively that dumping a particular
>> object to text and reloading it will produce the same internal
>> representation? I think it would be simpler to represent integers
>> using a string of digits; that way you can be assured of going from
>> text -> binary -> text without change.
>>
>> Perhaps it would be enough to define the high two bits as follows: 00
>> = array, 01 = object, 10 = string, 11 = number/true/false/null. The
>> next 2 bits specify how the length is stored. 00 = remaining 4 bits
>> store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
>> byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
>> additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
>> bytes store a full 32-bit length word. Then, the array, object, and
>> string representations can work as you've specified them. Anything
>> else can be represented by itself, or perhaps we should say that
>> numbers represent themselves and true/false/null are represented by a
>> 1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to
>> mean those values, since there's no obvious reason for the low bits to
>> be non-zero if a 4-bit length word ostensibly follows).
>>
>> So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b'
>> 81 'c'
>>
>> (I am still worried about the serialization/deserialization overhead
>> but that's a different issue.)
>>
>
> Thanks. I'll play around with the bit and numeric encodings you've
> recommended. Arrays are certainly the toughest to economize on as a text
> encoding has minimum of 2 + n - 1 overhead bytes. Text encoding for objects
> has some more wiggle room with 2 + (n-1) + (n*3) extra bytes. I was
> admittedly thinking about more complicated objects.
> I'm still worried about transcoding overhead as well. If comparing to a
> simple blind storage of JSON text with no validation or normalization, there
> is obviously no way to beat a straight copy. However, its not outside the
> realm of reason to think that it may be possible to match or beat the clock
> if comparing against text to text normalization, or perhaps adding slight
> overhead to a validator. The advantage to the binary structure is the
> ability to scan it hierarchically in sibling order and provide mutation
> operations with simple memcpy's as opposed to parse_to_ast -> modify_ast ->
> serialize_ast.
Yeah, my concern is not whether the overhead will be zero; it's
whether it will be small, yet allow large gains on other operations.
Like, how much slower will it be to pull out a moderately complex 1MB
JSON blob (not just a big string) out of a single-row, single-column
table? If it's 5% slower, that's probably OK, since this is a
reasonable approximation of a worst-case scenario. If it's 50%
slower, that sounds painful. It would also be worth testing with a
much smaller size, such as a 1K object with lots of internal
structure. In both cases, all data cached in shared_buffers, etc.
Then on the flip side how do we do on val[37]["whatever"]? You'd like
to hope that this will be significantly faster than the text encoding
on both large and small objects. If it's not, there's probably not
much point.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2010-10-24 12:16:38 | Re: knngist - 0.8 |
Previous Message | Robert Haas | 2010-10-24 12:05:29 | Re: ask for review of MERGE |