Add support for (Var op Var) clause in extended MCV statistics

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 10:42:07
Message-ID: 46d09d66-1692-4a12-a106-06981a47cb28@tantorlabs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I'd like to submit a patch that improves the estimated rows for queries
containing (Var op Var) clauses by applying extended MCV statistics.

*New functions:*

* mcv_clauselist_selectivity_var_op_var() - calculates the selectivity
for (Var op Var) clauses.
* is_opclause_var_op_var() - Checks whether a clause is of the (Var op
Var) form.

*Implementation Details:*

* A new 'if' statement was added to the 'clause_selectivity_ext()'
function to handle (Var op Var) clauses. This allows the process to
locate matching MCV extended statistics and calculate selectivity
using the newly introduced function.
* Additionally, I added 'if' statement
in statext_is_compatible_clause_internal() function to determine
which columns are included in the clause, find matching extended
statistics, and then calculate selectivity through the new function.
I did the same in mcv_get_match_bitmap() to check what values are
true for (Var op Var).
* To support this, I created a new enum type to differentiate between
OR/AND and (Var op Var) clauses.

*Examples:*

create table t (a int, b int);
insert into t select mod(i,10), mod(i,10)+1 from
generate_series(1,100000) s(i);
analyze t;
explain select * from t where a < b;
`
    Estimated:   33333
    Actual:       100000

explain select * from t where a > b;
`
    Estimated:   33333
    Actual:       100000

create statistics s (mcv) on a,b from t;
analyze t;
explain select * from t where a < b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000
    Actual:                             100000

explain select * from t where a > b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000
    Actual:                             100000

If you want to see more examples, see regress tests in the patch.

*Previous thread:*

This feature was originally developed two years ago in [1], and at that
time, the approach was almost the same. My implementation uses dedicated
functions and 'if' statements directly for better readability and
maintainability. Additionally, there was a bug in the previous approach
that has been resolved with my patch. Here’s an example of the bug and
its fix:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated:   18
    Actual:           9

CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated previous patch:  18
    Estimated current patch:      9
    Actual:                                  9

[1]:
https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com

I look forward to any feedback or suggestions from the community.

Best regars,
Ilia Evdokimov
Tantor Labs LLC.

Attachment Content-Type Size
v1-Add-support-for-Var-op-Var-clause-in-extended-MCV-st.patch text/x-patch 48.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-08-12 10:46:22 Re: ECPG cleanup and fix for clang compile-time problem
Previous Message Bertrand Drouvot 2024-08-12 10:35:44 Re: Historic snapshot doesn't track txns committed in BUILDING_SNAPSHOT state