Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Astapov <dastapov(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Date: 2022-02-17 22:16:31
Message-ID: CA+TgmoYYMvZm1EnHrD8sBovZ=x2Piy9_yew3J1LDyCW39Ey-HQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 17, 2022 at 4:17 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> IMHO the whole problem is we're unable to estimate the join clause as a
> conditional probability, i.e.
>
> P(A.x = B.x | (A.x < 42) & (B.x < 42))
>
> so maybe instead of trying to generate additional RelOptInfo items we
> should think about improving that. The extra RelOptInfos don't really
> solve this, because even if you decide to join A|x<42 to B|x<42 it does
> nothing to improve the join clause estimate.

I guess I hadn't considered that angle. I think the extra RelOptInfos
(or whatever) actually do solve a problem, because enforcing a
high-selectivity join qual against both sides is potentially quite
wasteful, and you need some way to decide whether to do it on one
side, the other, or both. But it's also true that I was wrong to
assume independence ... and if we could avoid assuming that, then the
join selectivity would work itself out without any of the machinery
that I just proposed.

> It actually deals with a more general form of this case, because the
> clauses don't need to reference the same attribute - so for example this
> would work too, assuming there is extended stats object on the columns
> on each side:
>
> P(A.c = B.d | (A.e < 42) & (B.f < 42))

That'd be cool.

> Not sure. In my experience queries with both a join clause and other
> clauses referencing the same attribute are pretty rare. But I agree if
> we can do the expensive stuff only when actually needed, with no cost in
> the 99.999% other cases, I don't see why not. Of course, code complexity
> is a cost too.

Right. I mean, we could have a planner GUC to control whether the
optimization is used even in cases where we see that it's possible.
But Tom keeps arguing that it is possible in many queries and would
benefit few queries, and I'm not seeing why that should be so. I think
it's likely to benefit many of the queries to which it applies.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-02-17 22:18:58 Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM
Previous Message Robert Haas 2022-02-17 21:36:07 Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM