From: | "John Surcombe" <John(dot)Surcombe(at)digimap(dot)gg> |
---|---|
To: | <pgsql-performance(at)postgresql(dot)org> |
Cc: | "Kieran Senior" <Kieran(dot)Senior(at)digimap(dot)gg> |
Subject: | Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1 |
Date: | 2011-03-14 09:37:40 |
Message-ID: | C0B87DEF8EF4F94C9D8FBD3151B627B26C0C47@digimapserver.digimap.local |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
> >> When we 'EXPLAIN' this query, PostgreSQL says it is using the index
> >> idx_receiveddatetime. The way the application is designed means
that
> >> in virtually all cases the query will have to scan a very long way
> >> into idx_receiveddatetime to find the first record where userid =
> 311369000.
> >> If however we delete the idx_receiveddatetime index, the query uses
> >> the idx_userid_receiveddatetime index, and the query only takes a
few
> >> milliseconds.
>
> > That's just bizarre ... it knows the index is applicable, and the
cost
> > estimates clearly favor the better index, so why did it pick the
worse
> > one?
>
> No, scratch that, I misread the plans. It *is* picking the plan it
thinks has
> lower cost; it's just a mistaken cost estimate. It's strange though
that the less
> selective indexscan is getting a lower cost estimate. I wonder
whether your
> table is (almost) perfectly ordered by receiveddatetime, such that the
one-
> column index has correlation close to 1.0. That could possibly lower
the cost
> estimate to the point where it'd appear to dominate the other index.
It'd be
> useful to see the pg_stats.correlation value for both the userid and
> receiveddatetime columns.
Yes, the table is indeed nearly perfectly ordered by receiveddatetime
(correlation 0.998479). correlation on userid is -0.065556. n_distinct
on userid is also low: 1097.
Is the problem perhaps something like the following: PostgreSQL is
thinking that because there are not many userids and there is low
correlation, that if it just scans the table from the top in date order,
this will be cheap (because receiveddatetime correlation is high so it
won't have to seek randomly), and it won't have to scan very far before
it finds the first row with a matching userid.
The problem is though that in our application the userids are not
scattered randomly. There are small regions of the table where a
particular userid appears frequently, interspersed with much larger
regions (perhaps millions or tens of millions of rows) where it doesn't
appear at all. So in fact the planner's preferred solution is often
pathologically bad.
Is there a solution?
From | Date | Subject | |
---|---|---|---|
Next Message | Marti Raudsepp | 2011-03-14 10:17:34 | Re: Tuning massive UPDATES and GROUP BY's? |
Previous Message | Julius Tuskenis | 2011-03-14 08:46:07 | Re: unexpected stable function behavior |