Re: Abnormal JSON query performance

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: 007reader <007reader(at)gmail(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Abnormal JSON query performance
Date: 2018-05-14 05:04:16
Message-ID: CAFj8pRCbXoxEmfq_zCuNhQHqgPKYhHXWfzsahWi+OsmNgbw8SQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

2018-05-14 5:48 GMT+02:00 <007reader(at)gmail(dot)com>:

> I think we are talking about two different use cases. I am not interested
> in graph operations on a document.
>
> My interest is in fast access to data. On a relational table, the query
> time is about the same whether I have one or ten fields in a select
> statement. I’d love to see the same behavior when getting multiple keys
> from a JSON document. That doesn’t seem to require graph manipulations. In
> a simplistic naive approach, it may be just walking a JSON document and
> getting all fields from the select statement in a single pass. It’s not
> quite full independence from the number of fields, but should be better
> than doubling execution time for each additional key in the current
> implementation.
>

Your expectation are not valid for longer fields. Just one experiment:

create table test(a varchar, b varchar);

CREATE OR REPLACE FUNCTION public.random_str(integer)
RETURNS character varying
LANGUAGE sql
AS $function$
select string_agg(v, '') from (
select substring('abcdefghijklmnopqrstuvwxyz' from (random()*26)::int + 1
for 1) v
from generate_series(1, $1)) s;
$function$

insert into test select random_str(4000), random_str(4000) from
generate_series(1,100000);

postgres=# explain analyze select length(a) from test;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test (cost=0.00..2084.00 rows=100000 width=4) (actual
time=0.315..2952.151 rows=100000 loops=1) │
│ Planning Time: 0.109
ms

│ Execution Time: 2960.654
ms

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)

postgres=# explain analyze select length(a), length(b) from test;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test (cost=0.00..2334.00 rows=100000 width=8) (actual
time=0.367..5420.946 rows=100000 loops=1) │
│ Planning Time: 0.103
ms

│ Execution Time: 5431.446
ms

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)

postgres=# explain analyze select length(a), length(a) from test;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test (cost=0.00..2334.00 rows=100000 width=8) (actual
time=0.367..5404.059 rows=100000 loops=1) │
│ Planning Time: 0.100
ms

│ Execution Time: 5414.443
ms

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)

You can break document to more smaller pieces - but it has limits too.

The total speed depends on few factors:

1. How much data are read from disc - t1 (can be unimportant when data are
in PostgreSQL cache - shared buffers)
2. Decompression - longer XML, JSON, Jsonb are stored in disc, in
PostgreSQL cache in compressed form, every time every access to long doc
does decompression - t2 .. Jsonb documents can be longer than JSON
3. Parsing document - every time for every access significant for XML and
JSON - t3
4. Loading binary document - every time for every access significant for
Jsonb - t4

Total time is t1 + t2 + t3 + t4.

I am not sure if t2 is optimized now for multiple access to same data -
probably not - you can see some messages in archive about cache for
detoasted data. Postgres has some optimizations, but its are used only from
PLpgSQL and only for arrays and records. Surely not for JSON or Jsonb.

Probably you can do some optimization using by some own functions

CREATE OR REPLACE FUNCTION fx(d jsonb, OUT a text, b text)
AS $$
BEGIN
a := d ->>'a';
b := d ->>'b';
END;
$$ LANGUAGE plpgsql;

SELECT a, b FROM (SELECT fx(d) FROM jsontab) s;

can be faster than

SELECT d->>'a', d->>'b' FROM jsontab

due eliminate repeated t2 time.

Maybe someone write special row cache, that can eliminate repeated t2, t3,
t4 times - that can be interesting for repeated access to different fields
in JSON document. Currently there is nothing similar. Although there is a
patch for JSON_TABLE function.

It is analogy to XMLTABLE function. With this function you can take more
fields with just one document processing.

Regards

Pavel Stehule

>
> On May 13, 2018, at 12:43 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> wrote:
>
>
>
> 2018-05-13 19:16 GMT+02:00 <007reader(at)gmail(dot)com>:
>
>> You are right. I tried similar queries for the array data type retrieving
>> multiple selected array elements and the query time increases as the number
>> of requested array elements increases.
>> This is a very disappointing fact. Documentation and multiple articles on
>> the internet promote ability to extract specific JSON keys like “select
>> field1->field2->>key1, field2->>key2, ...”, but as turns out to be very
>> inefficient. The design seems to favor retrieval of the entire json
>> document and processing on the application side.
>>
>
> PostgreSQL is not JSON or XML database, that breaks any document to graph
> and works on some graph. Unfortunately I don't know any database, that does
> it. Searching in very complex graph is available only in memory databases.
>
> PostgreSQL 11 allows index only scan over GIN index, if I remember well.
> Maybe it can help for you.
>
>
>>
>> Is it possible add a note to the documentation articulating this
>> important detail? I’ll ask someone to write a blog about this deficiency ...
>>
>
> "JSON data is subject to the same concurrency-control considerations as
> any other data type when stored in a table. Although storing large
> documents is practicable, keep in mind that any update acquires a row-level
> lock on the whole row. Consider limiting JSON documents to a manageable
> size in order to decrease lock contention among updating transactions.
> Ideally, JSON documents should each represent an atomic datum that business
> rules dictate cannot reasonably be further subdivided into smaller datums
> that could be modified independently."
>
> This is part of documentation - and should be mentioned, so JSON or JSONB
> document is read/write as one value every time. So if you use long
> documents and usually read few fields, then seq scan will be very
> ineffective, and index scan for more rows will be expensive too.
>
> Regards
>
> Pavel
>
>
>
>
>>
>> -Michael
>>
>> On May 12, 2018, at 12:36 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
>> wrote:
>>
>>
>>
>> 2018-05-12 9:31 GMT+02:00 <007reader(at)gmail(dot)com>:
>>
>>> Sorry again, autocorrect kills me. I use JSONB, not JSON. Serving the
>>> entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,
>>> selecting and assembling a few keys must be faster than assembling all of
>>> them.
>>>
>>
>> jsonb is preparsed, but that is all. PostgreSQL can't to store one value
>> to separate blocks. It can't to do for any type - json(b) is not a
>> exception.
>>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>>
>>> On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
>>> wrote:
>>>
>>> Hi
>>>
>>> 2018-05-12 6:32 GMT+02:00 <007reader(at)gmail(dot)com>:
>>>
>>>> I observed performance anomaly with JSON that I believe is incorrect in
>>>> version 10, although didn’t test earlier ones.
>>>> Here is a test table:
>>>> create table test(
>>>> id integer primary key,
>>>> data json
>>>> );
>>>>
>>>> A simple query works fine on a table of 92,000 records with ~60KB
>>>> average record size.
>>>>
>>>> select data->>key1 from test limit 1000;
>>>>
>>>> The query above takes about 155 ms. If I add additional field to the
>>>> list of fields, execution time increases linearly - retrieving 2 fields
>>>> take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5
>>>> fields looks normal
>>>>
>>>> Limit (cost=0.00..34.10 rows=1000 width=160) (actual
>>>> time=1.783..813.643 rows=1000 loops=1)
>>>> -> Seq Scan on loans (cost=0.00..3165.85 rows=92834 width=160)
>>>> (actual time=1.753..805.629 rows=1000 loops=1)
>>>> Planning time: 8.546 ms
>>>> Execution time: 817.874 ms
>>>>
>>>> Looks like every record gets parsed separately for each key... Unless
>>>> I’m doing something incorrectly, this makes retrieval of individual key
>>>> completely useless for practical cases. It’d leave only unattractive
>>>> alternative of getting the entire JSON document. In my case, instead of
>>>> retrieving a few hundred byte, it would force getting 60KB for each query...
>>>>
>>>
>>> It is not a bug. This behave is expected.
>>>
>>> Json values are stored as plain text in PostgreSQL. Every time if it is
>>> necessary it is parsed. If need fast access to one specific field, then you
>>> can use functional index
>>>
>>> CREATE INDEX ON test(data->>key1);
>>>
>>> You can try JSONB type - it is like Json, but it is preparsed, so access
>>> to some fields should be faster. Again, there can be used functional index,
>>> or special JSONB indexes for faster searching.
>>>
>>> Don't forget - PostgreSQL is relational databases - the sequential
>>> access to data will be most fast, when data will be stored in relational
>>> form - any attribute is stored in one specific column. No atomic data types
>>> are more dynamic, but access is significantly slower - it is not important
>>> for hundreds values, but it is clean when you read more than thousands
>>> values, ..
>>>
>>> p.s. index is your friend
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>>>
>>>
>>>>
>>>> -Bob Jones
>>>>
>>>
>>>
>>
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2018-05-14 05:18:42 Re: Abnormal JSON query performance
Previous Message David G. Johnston 2018-05-14 04:08:22 Abnormal JSON query performance