Re: Guidelines on best indexing strategy for varying searches on 20+ columns

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Niels Kristian Schjødt <nielskristian(at)autouncle(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org list" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Guidelines on best indexing strategy for varying searches on 20+ columns
Date: 2014-06-30 18:04:36
Message-ID: CAMkU=1wT-HEzLR=G9ENaBabg17qCwKMQZA=21_+ZsDRdXGva0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Jun 25, 2014 at 1:48 AM, Niels Kristian Schjødt
<nielskristian(at)autouncle(dot)com> wrote:
> Hi,
> I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation.
>
> Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search.
>
> The problem:
>
> The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria to search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc. etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by different columns (year, price, mileage and a score we create about the cars). By default we order by our own generated score.
>
> What I’ve done so far:
>
> I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actually for sale, the indexes are made as partial indexes on a sales state column.

I'd probably partition the data on whether it is for sale, and then
search only the for-sale partition.

>
> Questions:
>
> 1. How would you go about analyzing and determining what columns should be indexed, and how?

I'd start out with intuition about which columns are likely to be used
most often, and in a selective way. And followup by logging slow
queries so they can be dissected at leisure.

> 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinations varies a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do single column indexes etc. etc.)

There is no magic index. Based on your description, you are going to
be seq scanning your table a lot. Focus on making it as small as
possible, but vertical partitioning it so that the not-for-sale
entries are hived off to an historical table, and horizontally
partitioning it so that large columns rarely used in the where clause
are in a separate table (Ideally you would tell postgresql to
aggressively toast those columns, but there is no knob with which to
do that)

> 3. I expect that it does not make sense to index all columns?

You mean individually, or jointly? Either way, probably not.

> 4. I expect it does not make sense to index boolean columns?

In some cases it can, for example if the data distribution is very
lopsided and the value with the smaller side is frequently specified.

> 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them?

How often are the columns specified together? If they are completely
independent it probably makes little sense to index them together.

> 6. Would it be a goof idea to have all indexes sorted by my default sorting?

You don't get to choose. An btree index is sorted by the columns
specified in the index, according to the operators specified (or
defaulted). Unless you mean that you want to add the default sort
column to be the lead column in each index, that actually might make
sense.

> 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stay in memory etc.)?

If your queries are as unstructured as you imply, I'd forget about
indexes for the most part, as you will have a hard time findings ones
that work. Concentrate on making seq scans as fast as possible. If
most of your queries end in something like "ORDER by price limit 10"
then concentrate on index scans over price. You will probably want to
include heuristics in your UI such that if people configure queries to
download half your database, you disallow that. You will probably
find that 90% of the workload comes from people who are just playing
around with your website and don't actually intend to do business with
you.

Cheers,

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Frost 2014-06-30 18:34:52 Re: Postgres Replaying WAL slowly
Previous Message Soni M 2014-06-30 17:29:45 Re: Postgres Replaying WAL slowly