Quick Links

Re: Collecting statistics about contents of JSONB columns

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Mahendra Singh Thalor <mahi6run(at)gmail(dot)com>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: Collecting statistics about contents of JSONB columns
Date:	2022-05-17 16:06:57
Message-ID:	760222d8-22ce-dc2a-22b2-6c15bd7431f4@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 5/17/22 13:44, Mahendra Singh Thalor wrote:
> ...
>
> Hi Nikita,
> I and Tomas discussed the design for disabling all-paths
> collection(collect stats for only some paths). Below are some
> thoughts/doubts/questions.
>
> *Point 1)* Please can you elaborate more that how are you going to
> implement this(collect stats for only some paths).

I think Nikita mentioned he plans to only build stats only for most
common paths, which seems generally straightforward:

1) first pass over the documents, collect distinct paths and track how
many times we saw each one

2) in the second pass extract stats only for the most common paths (e.g.
the top 100 most common ones, or whatever the statistics target says)

I guess we might store at least the frequencing for uncommon paths,
which seems somewhat useful for selectivity estimation.

I wonder if we might further optimize this for less common paths. AFAICS
one of the reasons why we process the paths one by one (in the second
pass) is to limit memory consumption. By processing a single path, we
only need to accumulate values for that path.

But if we know the path is uncommon, we know there'll be few values. For
example the path may be only in 100 documents, not the whole sample. So
maybe we might process multiple paths at once (which would mean we don't
need to detoast the JSON documents that often, etc.).

OTOH that may be pointless, because if the paths are uncommon, chances
are the subsets of documents will be different, in which case it's
probably cheaper to just process the paths one by one.

> *Point 2) *As JSON stats are taking time so should we add an on/off
> switch to collect JSON stats?

IMHO we should think about doing that. I think it's not really possible
to eliminate (significant) regressions for all corner cases, and in many
cases people don't even need this statistics (e.g. when just storing and
retrieving JSON docs, without querying contents of the docs).

I don't know how exactly to enable/disable this - it very much depends
on how we store the stats. If we store that in pg_statistic, then ALTER
TABLE ... ALTER COLUMN seems like the right way to enable/disable these
path stats. We might also have a new "json" stats and do this through
CREATE STATISTICS. Or something else, not sure.

> *Point 3)* We thought of one more design: we can give an explicit path
> to collect stats for a particular path only or we can pass a subset of
> the JSON values but this may require a lot of code changes like syntax
> and all so we are thinking that it will be good if we can collect stats
> only for some common paths(by limit or any other way)
>

I'm not sure I understand what this is saying, particularly the part
about subset of JSON values. Can you elaborate?

I can imagine specifying a list of interesting paths, and we'd only
collect stats for the matching subset of the JSON documents. So if you
have huge JSON documents with complex schema, but you only query a very
limited subset of paths, we could restrict ANALYZE to this subset.

In fact, that's what the 'selective analyze' commit [1] in Nikita's
original patch set does in principle. We'd probably need to improve this
in some ways (e.g. to allow defining the column filter not just in
ANALYZE itself). I left it out of this patch to keep the patch as simple
as possible.

But why/how exactly would we limit the "JSON values"? Can you give some
example demonstrating that in practice?

regards

[1]
https://github.com/postgrespro/postgres/commit/7ab7397450df153e5a8563c978728cb731a0df33

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Collecting statistics about contents of JSONB columns at 2022-05-17 11:44:23 from Mahendra Singh Thalor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2022-05-17 16:24:24	Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
Previous Message	Tom Lane	2022-05-17 15:09:23	Re: create_help.pl treats <literal> as replaceable