Re: Collecting statistics about contents of JSONB columns

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>, Mahendra Thalor <mahendra(dot)thalor(at)enterprisedb(dot)com>, Oleg Bartunov <obartunov(at)postgrespro(dot)ru>
Subject: Re: Collecting statistics about contents of JSONB columns
Date: 2022-01-01 15:33:10
Message-ID: CALNJ-vQ7p=ZaZ=prwU=tV3FqXhF6CrW+u2Wve3iJVn2wGMBS8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> Hi,
>
> One of the complaints I sometimes hear from users and customers using
> Postgres to store JSON documents (as JSONB type, of course) is that the
> selectivity estimates are often pretty poor.
>
> Currently we only really have MCV and histograms with whole documents,
> and we can deduce some stats from that. But that is somewhat bogus
> because there's only ~100 documents in either MCV/histogram (with the
> default statistics target). And moreover we discard all "oversized"
> values (over 1kB) before even calculating those stats, which makes it
> even less representative.
>
> A couple weeks ago I started playing with this, and I experimented with
> improving extended statistics in this direction. After a while I noticed
> a forgotten development branch from 2016 which tried to do this by
> adding a custom typanalyze function, which seemed like a more natural
> idea (because it's really a statistics for a single column).
>
> But then I went to pgconf NYC in early December, and I spoke to Oleg
> about various JSON-related things, and he mentioned they've been working
> on this topic some time ago too, but did not have time to pursue it. So
> he pointed me to a branch [1] developed by Nikita Glukhov.
>
> I like Nikita's branch because it solved a couple architectural issues
> that I've been aware of, but only solved them in a rather hackish way.
>
> I had a discussion with Nikita about his approach what can we do to move
> it forward. He's focusing on other JSON stuff, but he's OK with me
> taking over and moving it forward. So here we go ...
>
> Nikita rebased his branch recently, I've kept improving it in various
> (mostly a lot of comments and docs, and some minor fixes and tweaks).
> I've pushed my version with a couple extra commits in [2], but you can
> ignore that except if you want to see what I added/changed.
>
> Attached is a couple patches adding adding the main part of the feature.
> There's a couple more commits in the github repositories, adding more
> advanced features - I'll briefly explain those later, but I'm not
> including them here because those are optional features and it'd be
> distracting to include them here.
>
> There are 6 patches in the series, but the magic mostly happens in parts
> 0001 and 0006. The other parts are mostly just adding infrastructure,
> which may be a sizeable amount of code, but the changes are fairly
> simple and obvious. So let's focus on 0001 and 0006.
>
> To add JSON statistics we need to do two basic things - we need to build
> the statistics and then we need to allow using them while estimating
> conditions.
>
>
> 1) building stats
>
> Let's talk about building the stats first. The patch does one of the
> things I experimented with - 0006 adds a jsonb_typanalyze function, and
> it associates it with the data type. The function extracts paths and
> values from the JSONB document, builds the statistics, and then stores
> the result in pg_statistic as a new stakind.
>
> I've been planning to store the stats in pg_statistic too, but I've been
> considering to use a custom data type. The patch does something far more
> elegant - it simply uses stavalues to store an array of JSONB documents,
> each describing stats for one path extracted from the sampled documents.
>
> One (very simple) element of the array might look like this:
>
> {"freq": 1,
> "json": {
> "mcv": {
> "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
> "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
> "width": 19,
> "distinct": 10,
> "nullfrac": 0,
> "correlation": 0.10449},
> "path": "$.\"a\"",
> "freq_null": 0, "freq_array": 0, "freq_object": 0,
> "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}
>
> In this case there's only a MCV list (represented by two arrays, just
> like in pg_statistic), but there might be another part with a histogram.
> There's also the other columns we'd expect to see in pg_statistic.
>
> In principle, we need pg_statistic for each path we extract from the
> JSON documents and decide it's interesting enough for estimation. There
> are probably other ways to serialize/represent this, but I find using
> JSONB for this pretty convenient because we need to deal with a mix of
> data types (for the same path), and other JSON specific stuff. Storing
> that in Postgres arrays would be problematic.
>
> I'm sure there's plenty open questions - for example I think we'll need
> some logic to decide which paths to keep, otherwise the statistics can
> get quite big, if we're dealing with large / variable documents. We're
> already doing something similar for MCV lists.
>
> One of Nikita's patches not included in this thread allow "selective"
> statistics, where you can define in advance a "filter" restricting which
> parts are considered interesting by ANALYZE. That's interesting, but I
> think we need some simple MCV-like heuristics first anyway.
>
> Another open question is how deep the stats should be. Imagine documents
> like this:
>
> {"a" : {"b" : {"c" : {"d" : ...}}}}
>
> The current patch build stats for all possible paths:
>
> "a"
> "a.b"
> "a.b.c"
> "a.b.c.d"
>
> and of course many of the paths will often have JSONB documents as
> values, not simple scalar values. I wonder if we should limit the depth
> somehow, and maybe build stats only for scalar values.
>
>
> 2) applying the statistics
>
> One of the problems is how to actually use the statistics. For @>
> operator it's simple enough, because it's (jsonb @> jsonb) so we have
> direct access to the stats. But often the conditions look like this:
>
> jsonb_column ->> 'key' = 'value'
>
> so the condition is actually on an expression, not on the JSONB column
> directly. My solutions were pretty ugly hacks, but Nikita had a neat
> idea - we can define a custom procedure for each operator, which is
> responsible for "calculating" the statistics for the expression.
>
> So in this case "->>" will have such "oprstat" procedure, which fetches
> stats for the JSONB column, extracts stats for the "key" path. And then
> we can use that for estimation of the (text = text) condition.
>
> This is what 0001 does, pretty much. We simply look for expression stats
> provided by an index, extended statistics, and then - if oprstat is
> defined for the operator - we try to derive the stats.
>
> This opens other interesting opportunities for the future - one of the
> parts adds oprstat for basic arithmetic operators, which allows deducing
> statistics for expressions like (a+10) from statistics on column (a).
>
> Which seems like a neat feature on it's own, but it also interacts with
> the extended statistics in somewhat non-obvious ways (especially when
> estimating GROUP BY cardinalities).
>
> Of course, there's a limit of what we can reasonably estimate - for
> example, there may be statistical dependencies between paths, and this
> patch does not even attempt to deal with that. In a way, this is similar
> to correlation between columns, except that here we have a dynamic set
> of columns, which makes it much much harder. We'd need something like
> extended stats on steroids, pretty much.
>
>
> I'm sure I've forgotten various important bits - many of them are
> mentioned or explained in comments, but I'm sure others are not. And I'd
> bet there are things I forgot about entirely or got wrong. So feel free
> to ask.
>
>
> In any case, I think this seems like a good first step to improve our
> estimates for JSOB columns.
>
> regards
>
>
> [1] https://github.com/postgrespro/postgres/tree/jsonb_stats
>
> [2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Hi,
For patch 1:

+ List *statisticsName = NIL; /* optional stats estimat.
procedure */

I think if the variable is named estimatorName (or something similar), it
would be easier for people to grasp its purpose.

+ /* XXX perhaps full "statistics" wording would be better */
+ else if (strcmp(defel->defname, "stats") == 0)

I would recommend (stats sounds too general):

+ else if (strcmp(defel->defname, "statsestimator") == 0)

+ statisticsOid = ValidateStatisticsEstimator(statisticsName);

statisticsOid -> statsEstimatorOid

For get_oprstat():

+ }
+ else
+ return (RegProcedure) InvalidOid;

keyword else is not needed (considering the return statement in if block).

For patch 06:

+ /* FIXME Could be before the memset, I guess? Checking
vardata->statsTuple. */
+ if (!data->statsTuple)
+ return false;

I would agree the check can be lifted above the memset call.

+ * XXX This does not really extract any stats, it merely allocates the
struct?
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)

As comments says, I think allocJsonPathStats() would be better name for the
func.

+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
+ * jsonPathStatsGetLengthStats does?

I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check
should be added for jsonPathStatsGetArrayLengthStats().

To be continued ...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-01-01 16:35:02 Re: Probable memory leak with ECPG and AIX
Previous Message Joel Jacobson 2022-01-01 12:52:43 Updatable Views and INSERT INTO ... ON CONFLICT