json indexing and data types

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kaare Rasmussen <kaare(at)jasonic(dot)dk>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: json indexing and data types
Date: 2015-12-03 01:06:34
Message-ID: CAHyXU0zHh8sMCeqDzk30O7BGd1b-hz8frKokfMvv_YFPz8NEPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Dec 2, 2015 at 12:03 AM, Kaare Rasmussen <kaare(at)jasonic(dot)dk
<javascript:;>> wrote:
> Hi
>
> As json essentially only has three basic data types, string, int, and
> boolean, I wonder how much of this - to index, search, and sort on
> unstructured data - is possible. I guess part of the answer would be
> 'jsquery and vodka', but let me describe the problem first.
>
> The basics is, that I have a column with what is essentially json data; a
> number of data structures of different depths. Perhaps 10 - 30 top levels,
> and probably no more than 3, max 4 levels deep. In total there are some
> hundred thousands of rows in each table. It would probably be best stored
as
> jsonb. Right now it's text, because it's only used by the application
> itself.
>
> It would be incredibly useful to add an index to this column, and to be
able
> to search, using the index, on arbitrary elements. This part seems already
> there, with jsquery.
>
> The hard part is that some of the data items really have another type.
There
> are dates and floating points, as the most important ones. And the really
> hard part is that sorting and range searches are important, especially for
> these two types. Having dates is iso-format, and left-padding floats with
> zeros is a low tech solution, and especially the latter is not very
> efficient.
>
> The solution might be to add functional indexes for these data items, but
> it's cumbersome and not easily maintainable. If a one-stop solution is in
> the works, or already there, it could save a lot of time.

I feel your pain. jsquery is superb for subdocument searching on
*specific* subdocuments but range searching is really limited. Value
searching is there for numerics but dates and text range searching are not
present. We also have to understand that you are asking the index to make
assumptions about the json that are not clear from the structure itself
(such as subfield 'x' is a date).

The only workaround I've been able to come up with is to migrate the json
to a specially encoded text field, stored side by side with the source
json, that is more amenable to pg_trgm based searching (to give you a taste
of that complexity, keys are stored upper case and values are stored lower
case).

Some might say that you're better off using a dedicated json searching
server like solr but these systems aren't magic; they will quickly boil
down to a brute force search in the face of complex queries, and they have
lots of other problems in my experience (starting with, lack of proper
transactions and painfully slow insertion of large documents). Other
people recommend them; I don't.

One way of looking at this problem is that the "schemaless" check is
getting cashed. If you need detailed data driven queries (as opposed to
more 'test searchy' type searches) perhaps it's time to start running your
data through a normalized structure.

merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jim Nasby 2015-12-03 01:07:55 Re: plperlu stored procedure seems to freeze for a minute
Previous Message Bruce Momjian 2015-12-03 00:58:13 Re: 9.4 upgrade Help using pg_upgrade