Re: POC, WIP: OR-clause support for indexes

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: jian he <jian(dot)universality(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>, Marcos Pegoraro <marcos(at)f10(dot)com(dot)br>, teodor(at)sigaev(dot)ru, Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Subject: Re: POC, WIP: OR-clause support for indexes
Date: 2024-06-21 22:52:02
Message-ID: 0f50882c-e639-4856-aab6-6ccfec848164@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21.06.2024 23:35, Robert Haas wrote:
> On Fri, Jun 21, 2024 at 1:05 PM Alena Rybakina
> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> I'm confused, I have seen that we have two threads [1] and [2] about this thread and I haven't found any explanation for how they differ.
>>
>> And I don't understand, why am I not listed as the author of the patch? I was developing the first part of the patch before Andrey came to review it [3] and his first part hasn't changed much since then.
> v25 still lists you as an author (in fact, the first author) but I
> can't say why we have two CommitFest entries. Surely that's a mistake.

Sorry, maybe I was overreacting.

Thank you very much for taking the time to do a detailed review!

On 22.06.2024 00:07, Alexander Korotkov wrote:
> Sorry, I didn't notice that the [1] commitfest entry exists and
> created the [2] commitfest entry. I'm removed [2].
Thank you!

> On the patch itself, I'm really glad we got to a design where this is
> part of planning, not parsing. I'm not sure yet whether we're doing it
> at the right time within the planner, but I think this *might* be
> right, whereas the old way was definitely wrong.

It's hard to tell, but I think it might be one of the good places to
apply transformation. Let me describe a brief conclusion on the two
approaches.

In the first approach, we definitely did not process the extra "OR"
expressions in the first approach, since they were packaged as an Array.
It could lead to the fact that less planning time would be spent on the
optimizer.

Also the selectivity for Array expressions is estimated better, which
could lead to the generation of a more optimal plan, but, to be honest,
this is just an observation from changes in regression tests and, in
general, how the process of calculating the selectivity of a complex
expression works.

 SELECT * FROM check_estimated_rows('SELECT * FROM
functional_dependencies WHERE (a = 1 OR a = 51) AND b = ''1''');

  estimated | actual

 -----------+--------

-        99 |    100

+       100 |    100

 (1 row)

 SELECT * FROM check_estimated_rows('SELECT * FROM
functional_dependencies WHERE (a = 1 OR a = 51) AND (b = ''1'' OR b =
''2'')');

  estimated | actual

 -----------+--------

-        99 |    100

+       100 |    100

 (1 row)

 SELECT * FROM check_estimated_rows('SELECT * FROM
functional_dependencies WHERE (a = 1 OR a = 2 OR a = 51 OR a = 52) AND
(b = ''1'' OR b = ''2'')');

  estimated | actual

 -----------+--------

-       197 |    200

+       200 |    200

In addition, we do not have new equivalence classes, since some “Or”
expressions are not available for their formation. This can result in
reduced memory and time spent generating the query plan, especially in
partitions.

Speaking of the main disadvantages, we do not give the optimizer the
opportunity to generate a plan using BitmapScan, which can lead to the
generation of a suboptimal plan, but in the current approach the same
thing happens [0].

And the second one might be related the lack of generation Equivalence
Classes and generation of useful pathkeysas a result, so we could miss
an optimal plan again. But I haven't caught something like this on
practice. I see we won't have such problems if we apply the
transformation later.

Overall, I have not yet noticed any very different parts from what was
in the first approach: I didn’t see any significant degradation or
improvement, which is good, but so far the main problem with the
degradation of the plan has not yet been solved, that is, we have not
escaped from the main problems.

Andrei mentioned the problem in the second approach about updating
references to expressions in RestrictInfo [1] lists, because the can be
used in different variables during the formation of the query plan. As
the practice of Self-join removal [2] has shown, this can be expensive,
but feasible.

By applying the transformation at the analysis stage in the first
approach, because no links were created, so we did not encounter such
problems, so this approach was more suitable than the others.

[0]
https://www.postgresql.org/message-id/7d5aed92-d4cc-4b76-8ae0-051d182c9eec%40postgrespro.ru

[1]
https://www.postgresql.org/message-id/6850c306-4e9d-40b7-8096-1f3c7d29cd9e%40postgrespro.ru

[2] https://commitfest.postgresql.org/48/5043/

> What exactly is the strategy around OR-clauses with type differences?
> If I'm reading the code correctly, the first loop requires an exact
> opno match, which presumably implies that the constant-type elements
> are of the same type. But then why does the second loop need to use
> coerce_to_common_type?

It needs to transform all similar constants to one type, because some
constants of "OR" expressions can belong others, like the numeric and
int types. Due to the fact that array structure demands that all types
must be belonged to one type, so for this reason we applied this procedure.

You can find the similar strategy in transformAExprIn function, when we
transform "In" list to SaopArray expression. Frankly, initially, I took
it as the main example to make my patch.

> + if (!OperatorIsVisible(entry->opno))
> + namelist = lappend(namelist,
> makeString(get_namespace_name(operform->oprnamespace)));
> +
> + namelist = lappend(namelist, makeString(pstrdup(NameStr(operform->oprname))));
> + ReleaseSysCache(opertup);
> +
> + saopexpr =
> + (ScalarArrayOpExpr *)
> + make_scalar_array_op(NULL,
> + namelist,
> + true,
> + (Node *) entry->expr,
> + (Node *) newa,
> + -1);
>
> I do not think this is acceptable. We should find a way to get the
> right operator into the ScalarArrayOpExpr without translating the OID
> back into a name and then back into an OID.
I don’t really understand the reason why it’s better not to do this. Can
you explain please?

> Also, why is the array built with eval_const_expressions instead of
> something like makeArrayResult? There should be no need for general
> expression evaluation here if we are just dealing with constants.
I'm not ready to answer this question right now, I need time to study
the use of the makeArrayResult function in the optimizer.
> + foreach(lc2, entry->exprs)
> + {
> + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc2);
> +
> + is_pushed_down = is_pushed_down || rinfo->is_pushed_down;
> + has_clone = has_clone || rinfo->is_pushed_down;
> + security_level = Max(security_level, rinfo->security_level);
> + required_relids = bms_union(required_relids, rinfo->required_relids);
> + incompatible_relids = bms_union(incompatible_relids,
> rinfo->incompatible_relids);
> + outer_relids = bms_union(outer_relids, rinfo->outer_relids);
> + }
> This seems like an extremely bad idea. Smushing together things with
> different security levels (or a bunch of these other properties) seems
> like it will break things. Presumably we wouldn't track these
> properties on a per-RelOptInfo basis unless we needed an accurate idea
> of the property value for each RelOptInfo. If the values are
> guaranteed to match, then it's fine, but then we don't need this code
> to merge possibly-different values. If they're not guaranteed to
> match, then presumably we shouldn't merge into a single OR clause
> unless they do.

We hadn't thought about it before, to be honest. But I agree with you
that this may be one of the reasons not to make the transformation.
> On a related note, it looks to me like the tests focus too much on
> simple cases. It seems like it's mostly testing cases where there are
> no security quals, no weird operator classes, no type mismatches, and
> few joins. In the cases where there are joins, it's an inner join and
> there's no distinction between an ON-qual and a WHERE-qual. I strongly
> suggest adding some test cases for weirder scenarios.
> + /* One more trick: assemble correct clause */
>
> This comment doesn't seem to make much sense. Some other comments
> contain spelling mistakes. The patch should have comments in more
> places explaining key design decisions.
> +extern JumbleState *JumbleExpr(Expr *expr, uint64 *exprId);
>
> This is no longer needed.
Agree.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-06-22 00:54:30 Re: Add pg_get_acl() function get the ACL for a database object
Previous Message Andreas 'ads' Scherbaum 2024-06-21 22:42:36 Re: call for applications: mentoring program for code contributors