Re: PoC/WIP: Extended statistics on expressions

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC/WIP: Extended statistics on expressions
Date: 2021-03-07 21:10:09
Message-ID: 1913c1fe-13da-7250-adf4-befeda397536@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here is an updated version of the patch series, addressing most of the
issues raised so far. It also adapts the new version of the bootstrap
patches shared by Justin on March 3.

I've merged the two patches reworking tracking of expressions, that I
kept separate to make review easier. But as we agree it's a good
approach, I've merged them into the main patch. FWIW I agree trying to
undo the bitmapsets entirely would be a step too far - the patch is
already quite large, so I'll leave it for the future.

As for the changes, I did a bunch of cleanup in the code supporting
functional dependencies and mcv. Most of this was cosmetic in the "not
changing" behavior - comments, undoing some unnecessary changes to make
the code more like before, regression tests, etc.

There are a couple notable changes, worth mentioning explicitly.

1) functional dependencies

I looked at merging dependency_is_compatible_expression and
dependency_is_compatible_clause, and I've made the code of those
functions much more similar. I didn't go as far as actually merging
those functions. Maybe we actually should do that to correctly handle
"nested cases" with Vars in expressions, but I'm not sure about that.

I added support for the SAOP and OR clauses. Not sure about the OR case,
but SAOP was missing mostly because that feature was added after this
patch was created.

This also required rethinking the handling of RestrictInfo - it was
required for expressions but not for Vars, for some reason. But that
would not work for OR clauses. So now it's mostly what we do for Vars.

There's a couple minor FIXMEs remaining, I'll look into those next.

2) ndistinct

So far the code in selfuncs.c using ndistinct stats to estimate GROUP BY
was quite WIP / experimental, and when I started looking at it, adding
regression tests etc., I discovered a bunch of bugs. Some of that was
due to the reworks in tracking expressions, but not all. I fixed all of
that and cleaned the code quite a bit. I'm not going to claim it's bug
free, but I think it's in a much better shape now.

There's one thing that's bugging me, in how we handle "partial" matches.
For each expression we track both the original expression and the Vars
we extract from it. If we can't find a statistics matching the whole
expression, we try to match those individual Vars, and we remove the
matching ones from the list. And in the end we multiply the estimates
for the remaining Vars.

This works fine with one matching ndistinct statistics. Consider for example

GROUP BY (a+b), (c+d)

with statistics on [(a+b),c] - that is, expression and one column. We
parse the expressions into two GroupExprInfo

{expr: (a+b), vars: [a, b]}
{expr: (c+d), vars: [c, d]}

and the statistics matches the first item exactly (the expression). The
second expression is not in the statistics, but we match "c". So we end
up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:

{expr: (c+d), vars: [d]}

Without any other statistics we estimate that as ndistinct for "d", so
we end up with

ndistinct((a+b), c) * ndistinct(d)

which mostly makes sense. It assumes ndistinct(c+d) is product of the
ndistinct estimates, but that's kinda what we've been always doing.

But now consider we have another statistics on just (c+d). In the second
loop we end up matching this expression exactly, so we end up with

ndistinct((a+b), c) * ndistinct((c+d))

i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
what we should do after the first loop is just discarding the whole
expression and "expand" into per-variable GroupExprInfo, so in the
second step we would not match the (c+d) statistics.

Of course, maybe there's a better way to pick the statistics, but I
think our conclusion so far was that people should just create
statistics covering all the columns in the query, to not have to match
multiple statistics like this.

3) regression tests

The patch adds a bunch of regression tests - I admit I've been adding
the tests a bit arbitrarily, mostly copy-paste of existing tests and
tweaking them to use expressions. This helped with identifying bugs, but
the runtime of the stats_ext test suite grew quite a lot - maybe 2-3x,
and it's not one of the slowest cases on my system (~3 seconds). I think
we need to either reduce the number of new tests, or maybe move some of
the tests into a separate parallel test suite.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-bootstrap-convert-Typ-to-a-List-20210307.patch text/x-patch 5.2 KB
0002-Allow-composite-types-in-bootstrap-20210307.patch text/x-patch 1.4 KB
0003-Use-correct-statistics-kind-in-a-couple-pla-20210307.patch text/x-patch 3.6 KB
0004-Extended-statistics-on-expressions-20210307.patch text/x-patch 366.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2021-03-07 22:41:42 Re: pg_upgrade failing for 200+ million Large Objects
Previous Message David Rowley 2021-03-07 21:08:14 Re: [patch] bit XOR aggregate functions