Re: POC, WIP: OR-clause support for indexes

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: jian he <jian(dot)universality(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nikolay Shaplov <dhyan(at)nataraj(dot)su>, pgsql-hackers(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, teodor(at)sigaev(dot)ru, Peter Eisentraut <peter(at)eisentraut(dot)org>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2024-10-28 16:55:35
Message-ID: 0026f562-d403-4f97-b9dc-3fe3a279e8c9@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Jian! Thank you for your work on this topic!

On 28.10.2024 10:19, jian he wrote:
> * NOTE: returns NULL if clause is an OR or AND clause; it is the
> * responsibility of higher-level routines to cope with those.
> */
> static IndexClause *
> match_clause_to_indexcol(PlannerInfo *root,
> RestrictInfo *rinfo,
> int indexcol,
> IndexOptInfo *index)
>
> the above comments need a slight change.
>
>
> EXPLAIN (COSTS OFF, settings) SELECT * FROM tenk2 WHERE (thousand = 1
> OR thousand = 3);
> QUERY PLAN
> -----------------------------------------------------------
> Bitmap Heap Scan on tenk2
> Recheck Cond: ((thousand = 1) OR (thousand = 3))
> -> Bitmap Index Scan on tenk2_thous_tenthous
> Index Cond: (thousand = ANY ('{1,3}'::integer[]))
>
> EXPLAIN (COSTS OFF, settings) SELECT * FROM tenk2 WHERE (thousand in (1,3));
> QUERY PLAN
> -----------------------------------------------------------
> Bitmap Heap Scan on tenk2
> Recheck Cond: (thousand = ANY ('{1,3}'::integer[]))
> -> Bitmap Index Scan on tenk2_thous_tenthous
> Index Cond: (thousand = ANY ('{1,3}'::integer[]))
>
> tenk2 index:
> Indexes:
> "tenk2_thous_tenthous" btree (thousand, tenthous)
>
> Looking at the above cases, I found out the "Recheck Cond" is
> different from "Index Cond".
> I wonder why there is a difference, or if they should be the same.
> then i come to:
> match_orclause_to_indexcol
>
> /*
> * Finally, build an IndexClause based on the SAOP node. Use
> * make_simple_restrictinfo() to get RestrictInfo with clean selectivity
> * estimations because it may differ from the estimation made for an OR
> * clause. Although it is not a lossy expression, keep the old version of
> * rinfo in iclause->rinfo to detect duplicates and recheck the original
> * clause.
> */
> iclause = makeNode(IndexClause);
> iclause->rinfo = rinfo;
> iclause->indexquals = list_make1(make_simple_restrictinfo(root,
> &saopexpr->xpr));
> iclause->lossy = false;
> iclause->indexcol = indexcol;
> iclause->indexcols = NIL;
>
> looking at create_bitmap_scan_plan.
> I think "iclause->rinfo" itself won't be able to detect duplicates.
> since the upper code would mostly use "iclause->indexquals" for comparison?
>
>
> typedef struct IndexClause comments says:
> "
> * indexquals is a list of RestrictInfos for the directly-usable index
> * conditions associated with this IndexClause. In the simplest case
> * it's a one-element list whose member is iclause->rinfo. Otherwise,
> * it contains one or more directly-usable indexqual conditions extracted
> * from the given clause. The 'lossy' flag indicates whether the
> * indexquals are semantically equivalent to the original clause, or
> * represent a weaker condition.
> "
> should lossy be iclause->lossy be true at the end of match_orclause_to_indexcol?
> since it meets the comment condition: "semantically equivalent to the
> original clause"
> or is the above comment slightly wrong?
>
> in match_orclause_to_indexcol
> i changed from
> iclause->rinfo = rinfo;
> to
> iclause->rinfo = make_simple_restrictinfo(root,
> &saopexpr->xpr);
>
> as expected. now the "Recheck Cond" is same as "Index Cond"
> Recheck Cond: (thousand = ANY ('{1,3}'::integer[]))
> -> Bitmap Index Scan on tenk2_thous_tenthous
> Index Cond: (thousand = ANY ('{1,3}'::integer[]))
>
> I am not sure of the implication of this change.
>
I may be wrong, but the original idea was to double-check the result
with the original expression.

But I'm willing to agree with you. I think we should add transformed
rinfo variable through add_predicate_to_index_quals function. I attached
the diff file to the letter.

diff --git a/src/backend/optimizer/path/indxpath.c
b/src/backend/optimizer/path/indxpath.c
index 3da7ea8ed57..c68ac7008e6 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3463,10 +3463,11 @@ match_orclause_to_indexcol(PlannerInfo *root,
      * rinfo in iclause->rinfo to detect duplicates and recheck the
original
      * clause.
      */
+    RestrictInfo *rinfo_new = make_simple_restrictinfo(root,
+ &saopexpr->xpr);
     iclause = makeNode(IndexClause);
-    iclause->rinfo = rinfo;
-    iclause->indexquals = list_make1(make_simple_restrictinfo(root,
- &saopexpr->xpr));
+    iclause->rinfo = rinfo_new;
+    iclause->indexquals = add_predicate_to_index_quals(index,
list_make1(rinfo_new));
     iclause->lossy = false;
     iclause->indexcol = indexcol;
     iclause->indexcols = NIL;

I figured out comments that you mentioned and found some addition
explanation.

As I understand it, this processing is related to ensuring that the
selectivity of the index is assessed correctly and that there is no
underestimation, which can lead to the selection of a partial index in
the plan. See comment for the add_predicate_to_index_quals function:

* ANDing the index predicate with the explicitly given indexquals produces
 * a more accurate idea of the index's selectivity. *However, we need to be
 * careful not to insert redundant clauses, because
clauselist_selectivity()
 * is easily fooled into computing a too-low selectivity estimate*.  Our
 * approach is to add only the predicate clause(s) that cannot be proven to
 * be implied by the given indexquals.  This successfully handles cases
such
 * as a qual "x = 42" used with a partial index "WHERE x >= 40 AND x < 50".
 * There are many other cases where we won't detect redundancy, leading
to a
 * too-low selectivity estimate, which will bias the system in favor of
using
 * partial indexes where possible.  That is not necessarily bad though.
 *
 * *Note that indexQuals contains RestrictInfo nodes while the indpred
 * does not, so the output list will be mixed.  This is OK for both
 * predicate_implied_by() and clauselist_selectivity()*, but might be
 * problematic if the result were passed to other things.
 */

In those comments that you mentioned, it was written that this problem
of expression redundancy is checked using the predicate_implied_by
function, note that it is called there.

* In some situations (particularly with OR'd index conditions) we may *
have scan_clauses that are not equal to, but are logically implied by, *
the index quals; so we also try a predicate_implied_by() check to see *
if we can discard quals that way. (predicate_implied_by assumes its *
first input contains only immutable functions, so we have to check * that.)

I also figured out more information about loosy variable. First of all,
I tried changing the value of the variable and did not notice any
difference in regression tests. As I understood, our transformation is
completely equivalent, so loosy should be true. But I don't think they
are needed since our expressions are equivalent. I thought for a long
time about an example where this could be a mistake and didn’t come up
with any of them.

--
Regards,
Alena Rybakina
Postgres Professional

Attachment Content-Type Size
or_any.diff text/x-patch 11.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-10-28 16:58:08 Re: protocol-level wait-for-LSN
Previous Message Heikki Linnakangas 2024-10-28 16:52:15 Re: Avoid possible overflow (src/port/bsearch_arg.c)