Re: How to speed up pg_trgm / gin index scan

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Christian Ramseyer <rc(at)networkz(dot)ch>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to speed up pg_trgm / gin index scan
Date: 2015-06-22 22:10:04
Message-ID: CAMkU=1zwdiLtk_sQa-7+ijKHxgvaMeZ-Y0KRbiYDBX5upWoYsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Jun 22, 2015 at 10:39 AM, Christian Ramseyer <rc(at)networkz(dot)ch> wrote:

> On 22/06/15 19:00, Jeff Janes wrote:
>
> >
> >
> > A typical query on this table looks like this:
> >
> > explain analyze
> > select log_date, host, msg
> > from logs_01 as log where log.msg like '%192.23.33.177%'
> > and log.log_date >= '2015-1-18 1:45:24'
> > and log.log_date <= '2015-1-19 1:45:24'
> > order by log_date asc offset 200 limit 50;
> >
> >
> > I think that trigram indexes are not well-suited to searching IP
> addresses.
> >
> > If the typical query is always an IP address for the LIKE, I think you
> > would want to build an index specifically tailored to that. You could
> > make a function to parse the IP address out of the msg, and then make a
> > functional index, for example. It would require you to write the query
> > differently. Whether it would be a btree index or a gin index would
> > depend on whether you can have more than one IP address in a msg.
> >
>
> Thanks Jeff, but the IP address was mostly an example... I should have
> written this more clearly. Generally the searched string will be a
> random short fragment from msg (ip, hostname, some part of an error
> message etc.).
>

What kind of timings do you get if you search on a hostname or part of an
error message? Is it slow in general, or just when the thing you search on
happens to be an IP address?

Certainly in my hands, trigram index searching with embedded IP addresses
are much worse than on embedded natural language fragments.

>
> It must be matched exactly including all punctuation etc, so trigrams
> look very suitable.
>

I believe the default compilation of pg_trgm ignores all punctuation
(converts them to whitespace) in the index. For a LIKE query, it catches
them when it rechecks the actual tuple in the heap so you still get the
right answer. But if the query is mostly punctuation and short numbers, it
takes much longer to get that correct answer.

Since the time range you query over is narrow and the rows are probably
well-clustered on it, maybe just using the logs_01_date_index would be
faster and then just filtering the table with the LIKE clause:

explain (analyze, buffers)
select log_date, host, msg
from logs_01 as log where (log.msg||'') like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;

Other options would be making your partitions smaller, or using btree_gist
and make an index on (log_date, msg gist_trgm_ops). Unfortunately those
indexes can be awful slow to build.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2015-06-22 23:30:01 Re: How to speed up pg_trgm / gin index scan
Previous Message Suresh Raja 2015-06-22 21:10:23 Run analyze on schema