Re: query against large table not using sensible index to find very small amount of data

From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: "'Andrew W(dot) Gibbs'" <awgibbs(at)awgibbs(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: query against large table not using sensible index to find very small amount of data
Date: 2014-04-08 13:39:41
Message-ID: 0683F5F5A5C7FE419A752A034B4A0B97978B5AD2@sswchi5pmbx2.peak6.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


> Other possibly relevant pieces of information... The entity type
> column has a cardinality in the neighborhood of a couple dozen.
> Meanwhile, for some of the entity types there is a large and ongoing
> number of events, and for other entity types there is a smaller and
> more sporadic number of events. Every now and again a new entity
> type shows up.

With that as the case, I have two questions for you:

1. Why do you have a low cardinality column as the first column in an index?
2. Do you have any queries at all that only use the entity type as the only where clause?

I agree that the planner is probably wrong here, but these choices aren't helping. The low cardinality of the first column causes very large buckets that don't limit results very well at all. Combined with the order-by clause, the planner really wants to walk the date index backwards to find results instead. I would do a couple of things.

First, remove the type/date index. Next, do a count of each type in the table with something like this:

SELECT type_id, count(1)
FROM my_table
GROUP BY 2

Any type that is more than 20% of the table will probably never be useful in an index. At this point, you have a choice. You can create a new index with date and type *in that order* or create a new partial index on date and type that also ignores the top matches. For instance, if you had a type that was 90% of the values, this would be my suggestion:

CREATE INDEX idx_foo_table_date_event_type_part ON foo_table (event_date, event_type)
WHERE event_type != 14;

Or whatever. If the IDs are basically evenly distributed, it won't really matter.

In any case, index order matters. The planner wants to restrict data as quickly as possible. If you provide an order clause, it wants to read the index in that order. Your specified type as the first column disrupts that, so it has to fetch the values first, which is usually more expensive. Even if that's wrong in your particular case, planner stats are not precise enough to know that.

Either way, try moving the indexes around. I can't think of many indexes in our database where I have the low cardinality value as the first column. Databases have an easier time managing many shallow buckets of values, than a few deep ones.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
312-676-8870
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2014-04-08 13:50:01 Re: [PERFORM] performance drop when function argument is evaluated in WHERE clause
Previous Message Hans Drexler 2014-04-08 13:21:25 Re: batch update query performance