Re: Query chooses Bad Index Path

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Valli Annamalai <aishwaryaanns(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Query chooses Bad Index Path
Date: 2022-02-09 16:51:48
Message-ID: f25f15cb-883f-f6a5-52e1-12b37fb890d3@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

It's a bit annoying that you post the same query over and over again,
starting a new thread every time. Don't do that, please, it's just
confusing, people lose track of information you already provided in
other threads etc.

Now, to the question ...

On 2/9/22 06:37, Valli Annamalai wrote:
> Postgres version: 11.4
>
> Problem:
>     Query choosing Bad Index Path. Details are provided below
>
> ...
>
>
> Doubt
>    1. Why is this Query choosing Index Scan Backward using table1_pkey
> Index though it's cost is high. It can rather choose
>             BITMAP OR
>                   (Index on RECORDID) i.e; table1_idx6
>                   (Index on RELATEDID) i.e; table1_idx7
>
>       Below is the selectivity details from pg_stats table
>         - Recordid has 51969 distinct values. And selectivity
> (most_common_freqs) for recordid = 15842006928391817 is 0.00376667
>         - Relatedid has 82128 distinct values. And selectivity
> (most_common_freqs) for recordid = 15842006928391817 is 0.0050666
>
> Since, selectivity is less, this should logically choose this Index,
> which would have improve my query performance here.

Well, the filter condition is much more complex - it's not just
conditions on recordid, but various conditions on other columns, with
both AND and OR. So it's possible the estimate is off, and the optimizer
picks the wrong plan. Try running explain analyze without the LIMIT,
that'll tell you how accurate the estimates are (LIMIT terminates early,
so the actual rowcount is incomplete).

The other option is data distribution issue, as pointed out by Monika
Yadav in the other thread. The optimizer assumes matching rows are
distributed uniformly in the input relation, but chances are they're
either close to beginning/end depending on how you sort it.

Imagine you have 1000000 rows, 1000 of them match the filter, and you
have LIMIT 10. It the matching rows are distributed uniformly, it's
enough to scan 1% of the input, i.e. 10000 rows (because there's one
matching row for every 1000 rows, on average).

But let's assume the matching rows are not distributed uniformly, but at
the end, when you sort it. Well, you'll have go through 100% of the
input. But the optimizer won't realize that.

This is a known / common issue with LIMIT, unfortunately. The estimated
cost is much lower that it should be, and it's hard to fix.

> I cross-checked the same by removing PrimaryKey to this table and query
> now chooses these indexes and response is in 100ms. Please refer the
> plan below (after removing primary key):
>
>

Well, yeah. That's mostly consistent with the data distribution theory.

I'd try two things:

1) define a covering index, so that the query can do Index Only Scan

2) define partial index, moving some of the filter conditions to index
predicate (not sure if that's possible, it depends on what parameters of
the condition are static)

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2022-02-18 15:38:56 Re: Poor performance PostgreSQL 13/ PostGIS 3.x
Previous Message Valli Annamalai 2022-02-09 05:37:55 Query chooses Bad Index Path