Quick Links

Re: A wrong index choose issue because of inaccurate statistics

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: A wrong index choose issue because of inaccurate statistics
Date:	2020-06-05 06:18:53
Message-ID:	CAApHDvovVWCbeR4v+A4Dkwb=YS_GuJG9OyCm8jZu++cP2xsY_A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 1 Jun 2020 at 01:24, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> The one-line fix describe the exact idea in my mind:
>
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -730,6 +730,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
>
> cpu_run_cost += cpu_per_tuple * tuples_fetched;
>
> + /*
> + * To make the planner more robust to handle some inaccurate statistics
> + * issue, we will add a extra cost to qpquals so that the less qpquals
> + * the lower cost it has.
> + */
> + cpu_run_cost += 0.01 * list_length(qpquals);
> +
>
> This change do fix the issue above, but will it make some other cases worse? My
> answer is no based on my current knowledge, and this is most important place
> I want to get advised. The mainly impact I can figure out is: it not only
> change the cost difference between (a, b) and (a, c) it also cause the cost
> difference between Index scan on (a, c) and seq scan. However the
> cost different between index scan and seq scan are huge by practice so
> I don't think this impact is harmful.

Didn't that idea already get shot down in the final paragraph on [1]?

I understand that you wish to increase the cost by some seemingly
innocent constant to fix your problem case. Here are my thoughts
about that: Telling lies is not really that easy to pull off. Bad
liers think it's easy and good ones know it's hard. The problem is
that the lies can start small, but then at some point the future you
must fashion some more lies to account for your initial lie. Rinse and
repeat that a few times and before you know it, your course is set
well away from the point of truth. I feel the part about "rinse and
repeat" applies reasonably well to how join costing works. The lie is
likely to be amplified as the join level gets deeper.

I think you need to think of a more generic solution and propose that
instead. There are plenty of other quirks in the planner that can
cause suffering due to inaccurate or non-existing statistics. For
example, due to how we multiply individual selectivity estimates,
having a few correlated columns in a join condition can cause the
number of join rows to be underestimated. Subsequent joins can then
end up making bad choices on which join operator to use based on those
inaccurate row estimates. There's also a problem with WHERE <x> ORDER
BY col LIMIT n; sometimes choosing an index that provides pre-sorted
input to the ORDER BY but cannot use <x> as an indexable condition.
We don't record any stats to make better choices there, maybe we
should, but either way, we're taking a bit risk there as all the rows
matching <x> might be right at the end of the index and we might need
to scan the entire thing before hitting the LIMIT. For now, we just
assume completely even distribution of rows. i.e. If there are 50 rows
estimated in the path and the limit is for 5 rows, then we'll assume
we need to read 10% of those before finding all the ones we need. In
reality, everything matching <x> might be 95% through the index and we
could end up reading 100% of rows. That particular problem is not just
caused by the uneven distribution of rows in the index, but also from
selectivity underestimation.

I'd more recently wondered if we shouldn't have some sort of "risk"
factor to the cost model. I really don't have ideas on how exactly we
would calculate the risk factor in all cases, but in this case, say
the risk factor always starts out as 1. We could multiply that risk
factor by some >1 const each time we find another index filter qual.
add_path() can prefer lower risk paths when the costs are similar.
Unsure what the exact add_path logic would be. Perhaps a GUC would
need to assist with the decision there. Likewise, with
NestedLoopPaths which have a large number of join quals, the risk
factor could go up a bit with those so that we take a stronger
consideration for hash or merge joins instead.

Anyway, it's pretty much a large research subject which would take a
lot of work to iron out even just the design. It's likely not a
perfect idea, but I think it has a bit more merit that trying to
introduce lies to the cost modal to account for a single case where
there is a problem.

David

[1] https://www.postgresql.org/message-id/20200529001602.eu7vuiouuuiclpgb%40development

In response to

A wrong index choose issue because of inaccurate statistics at 2020-05-31 13:24:05 from Andy Fan

Responses

Re: A wrong index choose issue because of inaccurate statistics at 2020-06-05 06:30:05 from Pavel Stehule
Re: A wrong index choose issue because of inaccurate statistics at 2020-06-09 03:33:03 from Andy Fan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2020-06-05 06:22:28	Re: Removal of currtid()/currtid2() and some table AM cleanup
Previous Message	Amit Kapila	2020-06-05 06:07:39	Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions