Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2022-07-21 11:42:19
Message-ID: ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/9/22 14:04, Dean Rasheed wrote:
> On Mon, 13 Dec 2021 at 02:21, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Hi,
>>
>> I finally got around to this patch again, focusing mostly on the first
>> part that simply returns either 1.0 or 0.0 for Var op Var conditions
>> (i.e. the part not really using extended statistics).
>>
>
> Just starting to look at this again, starting with 0001 ...
>
> This needs to account for nullfrac, since x = x is only true if x is not null.
>

Right, I forgot to account for nullfrac.

> I don't like how matching_restriction_variables() is adding a
> non-trivial amount of overhead to each of these selectivity functions,
> by calling examine_variable() for each side, duplicating what
> get_restriction_variable() does later on.
>

But matching_restriction_variables() does not use examine_variable()
anymore. It did, originally, but 0002 removed all of that. Now it does
just pull_varnos() and that's it. I admit leaving those bits unsquashed
was a bit confusing, but the first part was really 0001+0002+0003.

> I think it would probably be better to make the changes in
> var_eq_non_const(), where all of this information is available, and
> get rid of matching_restriction_variables() (it doesn't look like the
> new scalarineqsel_wrapper() code really needs
> matching_restriction_variables() - it can just use what
> get_restriction_variable() already returns).
>

I'm not sure how could that help. var_eq_non_const only really applies
to eqsel_internal(), so we'd still need to deal with various other
places for inequality operators.

Attached is a rebased and somewhat cleaned up version of the patch
series, addressing the review comments so far and squashing the bits I
previously kept separate to showcase the changes.

I've also added a bunch of regression tests - queries with (Var op Var)
clauses of varying complexity, to demonstrate the effect of each patch.
I added them as 0001, so it's clear how the individual patches affect
the results.

I've also wrote a single generator that generates both data and queries
with (Var op Var) clauses, and then runs them on multiple connections to
compare the estimates. It requires some effort to run that (setting up
the clusters, ...) but shouldn't be too hard to get it working.

The results are pretty massive (thousands of queries), but a simple
summary showing percentiles (0.25, 0.5, 0.75, 0.9) for estimation error,
calculated as (+1 to deal with 0 actual rows)

abs(estimated - actual) / (actual + 1)

looks like this:

clauses | stats | master | patched
---------+-------+---------------------------+------------------------
1 | no | {0.39, 0.67, 4.17, 10000} | {0.29, 0.67, 1, 10000}
2 | no | {0.38, 0.79, 50, 9950} | {0.26, 0.73, 1, 3333}
3 | no | {0.3, 0.84, 50, 3317} | {0.22, 0.78, 1, 1111}
4 | no | {0.24, 0.84, 25, 1852} | {0.14, 0.78, 1, 370}
5 | no | {0.2, 0.85, 17, 1100} | {0.11, 0.78, 1, 50}
1 | yes | {0.39, 0.67, 4.17, 10000} | {0, 0.14, 1, 1}
2 | yes | {0.38, 0.79, 50, 9950} | {0, 0.15, 1, 1}
3 | yes | {0.3, 0.84, 50, 3317} | {0, 0.15, 1, 1}
4 | yes | {0.24, 0.84, 25, 1852} | {0, 0.17, 1, 1}
5 | yes | {0.2, 0.85, 17, 1100} | {0, 0.14, 1, 1}

This seems pretty good, IMO. Without extended stats on the columns,
there's only so much we can do, but even then the errors got much
smaller. With the stats it's clearly way better.

Of course, take this with a grain of salt - those are randomly generated
synthetic queries, with all queries being considered equally "likely".
But clearly some queries are more likely to appear in the applications,
and those are more important to estimate. However, the point of this was
to see if there are classes of queries that would get much worse, and I
haven't found anything like that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
generator.py text/x-python 4.1 KB
0001-Add-regression-tests-for-Var-op-Var-20220721.patch text/x-patch 105.1 KB
0002-Improve-Var-op-Var-estimates-with-the-same--20220721.patch text/x-patch 56.0 KB
0003-Modify-selectivity-estimation-for-Var-op-Va-20220721.patch text/x-patch 40.3 KB
0004-Don-t-treat-Var-op-Var-as-simple-clauses-in-20220721.patch text/x-patch 13.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-07-21 11:42:20 Re: [PATCH v1] eliminate duplicate code in table.c
Previous Message Aleksander Alekseev 2022-07-21 11:39:33 Re: [PATCH v1] eliminate duplicate code in table.c