Re: Unable to make use of "deep" JSONB index

From: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Unable to make use of "deep" JSONB index
Date: 2022-06-13 15:59:06
Message-ID: CAHAc2jcu18rQ+_e+u4M1xpP462eZS+ufsBtueT18BQ8PM4byZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Jeff, David, Erik, Tom,

I now see that I misunderstood several aspects of the problem. With
regard to Jeff's last note, I note (and like!!!!):

- the use of the final ".id" to collapse the employee "dict" into an
array of simple values.
- the use of "?" to test elements in the array as "keys".

As Jeff noted, the "?" requires a text value. Luckily I have a
".username" in my real data that should work...so I think I have a
solution without recourse to a custom function! I have appended an
updated test case showing the indexing working. Finally, I do wonder
if a "jsonb_path_query_array_text" function might be worth considering
as an enhancement for when there is not a usable text value available?

Thanks again for all your kind help.

Shaheed

== test case showing how to index "deep" into a JSONB field ==

DROP TABLE payrun;
DROP INDEX idx1;

CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);

INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": "user999(at)foo(dot)com", "state":
{"employment": [["1920-01-01", null, 3]]}, "pay_graph": 0,
"last_run_of_employment": false}, "111": {"id": "user111(at)foo(dot)com",
"state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3,
"last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": "user999(at)foo(dot)com", "state":
{"employment": [["1970-01-01", null, 3]]}, "pay_graph": 6,
"last_run_of_employment": true}, "222": {"id": "user222(at)foo(dot)com",
"state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 5,
"last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": "user998(at)foo(dot)com", "state":
{"employment": [["1980-01-01", null, 3]]}, "pay_graph": 7,
"last_run_of_employment": false}, "333": {"id": "user333(at)foo(dot)com",
"state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3,
"last_run_of_employment": true}}}')
;

CREATE INDEX idx1 ON payrun USING gin
(jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 ||
@.last_run_of_employment == true || @.state.employment[last][2] ==
0).id'));

SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <>
0 || @.last_run_of_employment == true || @.state.employment[last][2]
== 0).id') ? 'user999(at)foo(dot)com';

--
-- Test
--
set enable_seqscan = OFF;

EXPLAIN ANALYSE
SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun"
WHERE jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <>
0 || @.last_run_of_employment == true || @.state.employment[last][2]
== 0).id') ? 'user999(at)foo(dot)com';

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------
Bitmap Heap Scan on payrun (cost=8.00..12.02 rows=1 width=36) (actual
time=0.022..0.023 rows=1 loops=1)
Recheck Cond: (jsonb_path_query_array(snapshot,
'$."employee".*?((@."pay_graph" != 0 || @."last_run_of_employment" ==
true) || @."state"."employment"[last][2] == 0)."id"'::jsonpath,
'{}'::jsonb, false) ? 'user999(at)foo(dot)com
'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on idx1 (cost=0.00..8.00 rows=1 width=0)
(actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (jsonb_path_query_array(snapshot,
'$."employee".*?((@."pay_graph" != 0 || @."last_run_of_employment" ==
true) || @."state"."employment"[last][2] == 0)."id"'::jsonpath,
'{}'::jsonb, false) ? 'user999(at)foo
.com'::text)
Planning Time: 0.062 ms
Execution Time: 0.041 ms
(7 rows)

On Mon, 13 Jun 2022 at 03:23, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> On Sun, Jun 12, 2022 at 3:31 PM Shaheed Haque <shaheedhaque(at)gmail(dot)com> wrote:
>>
>> (Resend, wrong version was sent before)
>>
>> Thanks Erik. Is the point that the index has to be on the JSON field
>> as a whole (i.e. "snapshot") rather than deep inside it (e.g.
>> "snapshot.something.further[down]")?
>>
>> In my case, the snapshot is several MB in size (perhaps 10MB or even
>> 20MB), dominated by the snapshot.employee (cardinality 10k, each sized
>> as a dict 1-2kB as text). My expectation/guess is that an index of
>> "snapshot" will itself be of a size of similar order. However the
>> design as-is works very well except for this one case where to speed
>> it up, in principle, the index need contain no more than one boolean
>> per employee. So that's what I'd like to achieve, if possible.
>
>
> It sounds like what you really want here is to extract just the list of the ids which meet one of your three further criteria, and index that list.
>
> You can do that with jsonpath, but you have to apply it with a function, not one of the boolean-returning operators.
>
> This almost works to do that:
>
> create index on payrun using gin (jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id'));
>
> Then query it like:
>
> select * from payrun where jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id') ? '999';
>
> The problem is that ? only tests for top-level text values, while in your example the value for the 'id' keys are ints, not text. So they are not findable with the ? operator. If I edit your example data to wrap the ids' values in double quotes, turning them into json strings rather than json ints, then this does work for me.
>
> Maybe there is a way to modify the jsonpath so that it converts the ints to text for you. But if there is such a way, I don't know what it is.
>
> If I were doing this for my own system, I would probably create an immutable user-defined-function which took a jsonb and returned an int[] of the filtered id values. Then you would have to query it with @> rather than ?.
>
>>
>> - I've not done things right, in which case I'd love to know my mistake.
>> - It is not supposed to work, in which case it would be good to have
>> that stated, and maybe have PG not allow useless indices to be
>> created.
>> - It is a bug.
>
>
> I guess it is number one mixed with number two. The index you created is useless for your intended purpose, but is not useless for every conceivable purpose. It is not realistic to expect PostgreSQL to reject things just because it is not obvious (to a computer) what you are getting up to.
>
> Cheers,
>
> Jeff

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2022-06-14 01:31:21 Re: BUG #17504: psql --single-transaction -vON_ERROR_STOP=1 still commits after client-side error
Previous Message Tom Lane 2022-06-13 15:16:36 Re: DELETE deletes more than one rows when LIMIT is used in the USING clause