Re: Less selective index chosen unexpectedly

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Less selective index chosen unexpectedly
Date: 2021-05-20 01:06:31
Message-ID: CAApHDvp8TYccf9mN_7ZAogh2jZYeNA-8UPCQZLBhuef1z=rhkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, 20 May 2021 at 02:54, James Coleman <jtc331(at)gmail(dot)com> wrote:
> What I'm interested in here is that, at least in this case (and I'm
> theorizing it's more generalizable, but that's why I wanted to get
> discussion on it) the risk of getting selectivity wrong is quite high.
> And it seems to me that a more specific index (though larger and
> costlier to descend) is generally lower risk, albeit higher cost in
> best cases for the less specific index. In this particular example it
> wouldn't even matter which permutation of column ordering you have in
> the more specific index -- it's always guaranteed to return the first
> row that's found by descending the tree (excluding vacuumable rows).

Unfortunately we don't currently include risk in our cost model.
Today's planner is fairly happy to dance blindfolded at the top of
tall cliffs. Unfortunately we fall off them a little too often, such
as in cases like you describe. It would be good to one day give the
planner some proper lessons in safety.

I have considered that we maybe should consider adding some sort of
risk factor in our cost model. I mentioned something in [1] about it,
but I don't really have a great idea yet as to how the "risk" cost
value would be calculated. Calculating that could be hard. It might
make sense to bump up the risk factor for some paths, but as we go
through and add a risk level to more paths, then what we set the risk
value to is suddenly much more important. It might be easy to say
that the risk value should be set to what the total_cost would have to
be multiplied by to get the cost for the worst possible case. However,
if the worst possible case is exceedingly unlikely, then maybe that's
not a great choice.

The planner does often take plenty of other risks. The situation you
describe here is just one of many. Another example is that if we do a
join on say 3 or more conditions that those conditions could be
correlated. We currently always assume they're independent and just
multiply individual selectivities. We could end up thinking that the
join will produce far fewer rows than it actually will. Problems can
arise in subsequent joins. We may end up doing a subsequent join using
a nested loop thinking that only 1 row will be produced from the first
join. That is likely to turn bad if the innermost join instead
produces millions of rows. Unfortunately extended statistics don't
yet help us with join selectivities.

We also take risks with LIMIT clauses as we assume the non-filtered
rows will be completely evenly distributed through the result. That
ends badly when the rows we want are right at the end.

David

[1] https://www.postgresql.org/message-id/CAApHDvqBoYU8aES4a0t-J15wk1wPMFJDHcyafyfHj7JqJ+u9wg@mail.gmail.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2021-05-20 01:18:29 Re: Generated column is not updated (Postgres 13)
Previous Message David G. Johnston 2021-05-20 01:02:39 Re: Generated column is not updated (Postgres 13)