Re: Guidelines on best indexing strategy for varying searches on 20+ columns

From: Niels Kristian Schjødt <nielskristian(at)autouncle(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org list" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Guidelines on best indexing strategy for varying searches on 20+ columns
Date: 2014-06-30 07:33:10
Message-ID: CF650FDA-3106-401F-9CD6-813FF41A4E9D@autouncle.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks for your suggestions, very useful. See comments inline:

Den 25/06/2014 kl. 23.48 skrev Merlin Moncure <mmoncure(at)gmail(dot)com>:

> On Wed, Jun 25, 2014 at 3:48 AM, Niels Kristian Schjødt
> <nielskristian(at)autouncle(dot)com> wrote:
>> Hi,
>> I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation.
>>
>> Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search.
>>
>> The problem:
>>
>> The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria to search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc. etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by different columns (year, price, mileage and a score we create about the cars). By default we order by our own generated score.
>>
>> What I’ve done so far:
>>
>> I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actually for sale, the indexes are made as partial indexes on a sales state column.
>>
>> Questions:
>>
>> 1. How would you go about analyzing and determining what columns should be indexed, and how?
>
> mainly frequency of access.
>
>> 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinations varies a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do single column indexes etc. etc.)
>
> don't make 20 indexes. consider installing pg_trgm (for optimized
> LIKE searching) or hstore (for optmized key value searching) and then
> using GIST/GIN for multiple attribute search. with 9.4 we have
> another fancy technique to explore: jsonb searching via GIST/GIN.

Interesting, do you have any good resources on this approach?
>
>> 3. I expect that it does not make sense to index all columns?
>
> well, maybe. if you only ever search one column at a time, then it
> might make sense. but if you need to search arbitrary criteria and
> frequently combine a large number, then no -- particularly if your
> dataset is very large and individual criteria are not very selective.

So, to just clarify: I’m often combining a large number of search criteria and the individual criteria is often not very selective, in that case, are you arguing for or against indexing all columns? :-)
>
>> 4. I expect it does not make sense to index boolean columns?
>
> in general, no. an important exception is if you are only interested
> in true or false and the number of records that have that interesting
> value is tiny relative to the size of the table. in that case, a
> partial index can be used for massive optimization.

Thanks, hadn’t been thinking about using partial indexes here as an option.
>
>> 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them?
>
> Only if you search those 5 columns together a significant portion of the time.
>
>> 6. Would it be a goof idea to have all indexes sorted by my default sorting?
>
> index order rarely matters. if you always search values backwards and
> the table is very large you may want to consider it. unfortunately
> this often doesn't work for composite indexes so sometimes we must
> explore the old school technique of reversing the value.
>
>> 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stay in memory etc.)?
>
> as noted above, fancy indexing is the first place to look. start
> with pg_trgm (for like optmization), hstore, and the new json stuff.
> the big limitation you will hit is that that most index strategies, at
> least fo the prepackaged stuff will support '=', or partial string
> (particularly with pg_trgm like), but not > or <: for range operations
> you have to post process the search or try to work the index from
> another angle.
>
> merlin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Pujol Mathieu 2014-06-30 07:40:00 Re: GIST optimization to limit calls to operator on sub nodes
Previous Message Pujol Mathieu 2014-06-30 07:26:08 Re: GIST optimization to limit calls to operator on sub nodes