Re: working around JSONB's lack of stats?

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: working around JSONB's lack of stats?
Date: 2015-01-31 01:34:40
Message-ID: 54CC3130.5030604@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 1/30/15 2:26 PM, Josh Berkus wrote:
> On 01/28/2015 03:50 PM, Peter Geoghegan wrote:
>> On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> jsonb_col @> '[ "key1" ]'
>>> or jsonb_col ? 'key1'
>>> if in MCE, assign % from MCE
>>> otherwise assign 1% of non-MCE %
>>>
>>> jsonb_col @> '{ "key1": "value1" }'
>>> if in MCE, assign MCE% * 0.1
>>> otherwise assign 0.01 of non-MCE %
>>>
>>> Does that make sense?
>>
>> I suspect it makes a lot less sense. The way people seem to want to
>> use jsonb is as a document store with a bit of flexibility. Individual
>> JSON documents tend to be fairly homogeneous in structure within a
>> table, just like with systems like MongoDB. Strings within arrays are
>> keys for our purposes, and these are often used for tags and so on.
>> But Strings that are the key of an object/pair are much less useful to
>> index, in my estimation.
>
> Yeah, I see your point; except for arrays, people are usually searching
> for a key:value pair, and the existence of the key is not in doubt.
>
> That would make the "element" the key:value pair, no? But
> realistically, we would only want to do that for simple keys and values.
>
> Although: if you "flatten" a nested JSON structure into just keys with
> scalar values (and array items as their own thing), then you could have
> a series of expanded key:value pairs to put into MCE.
>
> For example:
>
> { house : { city : San Francisco,
> sqft: 1200,
> color: blue,
> occupants: [ mom, dad, child1 ]
> }
> occupation: programmer
> }
>
> ... would get flattened out into the following pairs:
>
> city: san francisco
> sqft: 1200
> color: blue
> occupants: [ mom ]
> occupants: [ dad ]
> occupants: [ child1 ]
> occupation: programmer
>
> This would probably work because there aren't a lot of data structures
> where people would have the same key:value pair in different locations
> in the JSON, and care about it stats-wise. Alternatetly, if the same
> key-value pair appears multiple times in the same sample row, we could
> cut the MC% by that multiple.

Even if there were multiple occurrences, this would probably still be an
improvement.

Another idea... at one time in the past when discussing statistics on
multiple columns, one idea was to build statistics on indexes. If we
built that, we could also do the same thing for at least JSONB (not sure
about JSON). Obviously doesn't help for stuff you haven't indexed, but
presumably if you care about performance and have any significant size
of data you've also indexed parts of the JSON, yes?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jim Nasby 2015-01-31 01:40:13 Re: Query performance
Previous Message Jim Nasby 2015-01-31 01:28:26 Re: why pg_class.relfrozenxid needs to be updated for frozen tables (where all rows have xmin=2)?