Re: How to speed up pg_trgm / gin index scan

From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Christian Ramseyer <rc(at)networkz(dot)ch>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Oleg Bartunov <obartunov(at)gmail(dot)com>
Subject: Re: How to speed up pg_trgm / gin index scan
Date: 2015-07-28 14:45:44
Message-ID: CAO_YK0U4Dgku89opVkRDT3hp7gZNRr3FCWXcy20oF-NP+uK22w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Jul 28, 2015 at 10:34 AM, Christian Ramseyer <rc(at)networkz(dot)ch> wrote:

>
>
> On 22/06/15 13:51, Christian Ramseyer wrote:
> > Hi
> >
> > I have a pretty large table with syslog messages.
> >
> > It is already partitioned by month, and for a single month I have e.g.
> >
> >
> > DM=# \d+ logs_01
> >
> > Column | Type |
> > --------------+-----------------------------+
> > host | character varying(255) |
> > facility | character varying(10) |
> > priority | character varying(10) |
> > tag | character varying(255) |
> > log_date | timestamp without time zone |
> > program | character varying(255) |
> > msg | text |
> > seq | bigint |
> >
> > Indexes:
> > "logs_01_pkey" PRIMARY KEY, btree (seq)
> > "idx_logs_01_lower_host" btree (lower(host::text)
> varchar_pattern_ops)
> > "logs_01_date_index" btree (log_date)
> > "tridx_logs_01_msg" gin (msg gin_trgm_ops)
> >
> >
> > DM=# select count(*) from logs_01;
> > count
> > ----------
> > 83052864
> >
> >
> > I'd like to provide a fast "like %x%" search on the msg column, hence I
> added a trigram based gin index on it. It is around 60 GB on the 35 GB
> table:
> >
> > DM=# select count(*) from logs_01;
> > count
> > ----------
> > 83052864
> >
> >
> > DM=# \dt+ logs_01
> > List of relations
> > Schema | Name | Type | Owner | Size | Description
> > --------+---------+-------+----------+-------+-------------
> > public | logs_01 | table | postgres | 35 GB |
> >
> > DM=# \di+ tridx_logs_01_msg
> > List of relations
> > Schema | Name | Type | Owner | Table | Size |
> Description
> >
> --------+-------------------+-------+----------+---------+-------+-------------
> > public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
> >
> >
> > A typical query on this table looks like this:
> >
> > explain analyze
> > select log_date, host, msg
> > from logs_01 as log where log.msg like '%192.23.33.177%'
> > and log.log_date >= '2015-1-18 1:45:24'
> > and log.log_date <= '2015-1-19 1:45:24'
> > order by log_date asc offset 200 limit 50;
> >
> >
> > It yields a promising explain that shows that the index is used, but
> unfortunately the bitmap index scan on the GIN index takes quite long (40
> seconds)
> >
> >
> QUERY PLAN
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > Limit (cost=34510.06..34510.06 rows=1 width=195) (actual
> time=42971.002..42971.015 rows=50 loops=1)
> > -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual
> time=42970.960..42970.990 rows=250 loops=1)
> > Sort Key: log_date
> > Sort Method: top-N heapsort Memory: 152kB
> > -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44
> rows=124 width=195) (actual time=42963.969..42969.725 rows=2472 loops=1)
> > Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
> (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
> (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
> > -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0)
> (actual time=42962.562..42962.562 rows=0 loops=1)
> > -> Bitmap Index Scan on tridx_logs_01_msg
> (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145
> rows=168489 loops=1)
> > Index Cond: (msg ~~ '%192.23.33.177%'::text)
> > -> Bitmap Index Scan on logs_01_date_index
> (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055
> rows=1173048 loops=1)
> > Index Cond: ((log_date >= '2015-01-18
> 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
> 01:45:24'::timestamp without time zone))
> > Total runtime: 42971.137 ms
> >
> > (also on http://explain.depesz.com/s/KpaB)
> >
> >
> > Any good ideas on how I could speed this up a bit?
> >
> > I have already tried to throw quite a bunch of memory at the problem:
> >
> > shared_buffers = 64GB
> > work_mem = 16GB
> >
> > but it didn't improve between this and the 32GB shared/ 2GB work GB I
> had before.
> >
> > This is on Postgres 9.1.15 on Linux.
> >
>
>
>
> > Try 9.4 and you'll surprise.
> >
> > 1. GIN has compression
> > 2. GIN has fast scan feature.
> >
> > Oleg
>
>
> Hi Oleg and List
>
> I finally got around to try 9.4, and it is quite fantastic.
>
> Index size went from 58 to now 14 GB:
>
> DM=# \di+ tridx_logs_01_msg
> List of relations
> Schema | Name | Type | Owner | Table | Size |
> Description
>
> --------+-------------------+-------+----------+---------+-------+-------------
> public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB |
>
>
> And the time for the above query went down to about 20 seconds:
>
>
> DM=# explain analyze
> DM-# select log_date, host, msg
> DM-# from logs_01 as log where log.msg like '%192.23.33.177%'
> DM-# and log.log_date >= '2015-1-18 1:45:24'
> DM-# and log.log_date <= '2015-1-19 1:45:24'
> DM-# order by log_date asc offset 200 limit 50;
>
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=28815.06..28815.06 rows=1 width=194) (actual
> time=19032.099..19032.099 rows=0 loops=1)
> -> Sort (cost=28814.74..28815.06 rows=128 width=194) (actual
> time=19032.093..19032.093 rows=0 loops=1)
> Sort Key: log_date
> Sort Method: quicksort Memory: 25kB
> -> Bitmap Heap Scan on logs_01 log (cost=28298.06..28810.26
> rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1)
> Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
> (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
> (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
> -> BitmapAnd (cost=28298.06..28298.06 rows=128 width=0)
> (actual time=19031.983..19031.983 rows=0 loops=1)
> -> Bitmap Index Scan on tridx_logs_01_msg
> (cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121
> rows=99 loops=1)
> Index Cond: (msg ~~ '%192.23.33.177%'::text)
> -> Bitmap Index Scan on logs_01_date_index
> (cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084
> rows=1173048 loops=1)
> Index Cond: ((log_date >= '2015-01-18
> 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
> 01:45:24'::timestamp without time zone))
> Planning time: 0.945 ms
> Execution time: 19032.409 ms
> (13 rows)
>
> Great stuff! Sorry Oleg I don't have your original message anymore and
> can't reply into the right place in the thread, so I took the liberty to
> CC: you.
>
> Christian
>
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Christian

You could experiment recompiling pg_trgm commenting out the KEEPONLYALNUM
and/or IGNORECASE definitions if you are looking for exact matches, this
will increase the index size but will make it more selective.

Also, there's a thread around for pg_trgrm 1.2 which will get you even more
boost.

--
Arthur Silva

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Curt Micol 2015-07-28 17:54:57 Logical decoding off of a replica?
Previous Message Merlin Moncure 2015-07-28 14:42:26 Re: How to speed up pg_trgm / gin index scan