Re: json indexing and data types

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Kaare Rasmussen <kaare(at)jasonic(dot)dk>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: json indexing and data types
Date: 2015-12-03 00:04:13
Message-ID: 565F86FD.2020909@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12/2/15 12:03 AM, Kaare Rasmussen wrote:
>
> The hard part is that some of the data items really have another type.
> There are dates and floating points, as the most important ones. And the
> really hard part is that sorting and range searches are important,
> especially for these two types. Having dates is iso-format, and
> left-padding floats with zeros is a low tech solution, and especially
> the latter is not very efficient.
>
> The solution might be to add functional indexes for these data items,
> but it's cumbersome and not easily maintainable. If a one-stop solution
> is in the works, or already there, it could save a lot of time.

We have a client that has a similar (though also a bit different) need.
Specifically, they get an XML document that has element attributes that
tell you what data type the element should contain. We convert the XML
to JSON (easy thanks to plpython), which produces a bunch of nested JSON
objects (typed as specifically as possible in JSON). The XML attributes
get turned into items in an object. So

<some-element type="integer">42</some-element>

becomes something like

"some-element": { "@type": "integer", "#text": 42 }

Some transforms are applied to that (like replacing - with _), and the
resulting JSON is used to create a set of tables, where each table
contains one level of nesting (triggers to handle inserts are also
created). Finally, views that break out each element value are created
on top of these tables. If specific type info is available that's used
to determine the type of the column, otherwise an appropriate type is
chosen based on json_typeof(). This results in a view that looks
something like

SELECT ((json_data-> 'some_element')->>'#text')::integer AS some_element

The reason we went with one table per level was to allow full indexing
(without needing Vodka), because it made the code easier to develop (at
least during initial conception), and because it supports joining
between the views nicely (something we needed). You could probably do
this without splitting into multiple tables, but I suspect it would add
significant complexity to the view creation.

I'd like to eventually open source the guts of this, but unfortunately
there's a lot of work required to get it to the point where that would
be possible. There's also some choices that were made that in retrospect
should probably be done differently. Of course if someone wanted to pay
us to do that then we'll find the time ;). Short of that if someone is
really serious about helping with that effort I can start untangling
parts of this from the proprietary codebase that it's currently buried
in, but even that would be a pretty significant effort.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sheena, Prabhjot 2015-12-03 00:33:25 9.4 upgrade Help using pg_upgrade
Previous Message Jim Nasby 2015-12-02 23:25:37 Re: Pgbasebackup help