insert and query performance on big string table with pg_trgm

From: Matthew Hall <mhall(at)mhcomputing(dot)net>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: insert and query performance on big string table with pg_trgm
Date: 2017-11-20 22:54:01
Message-ID: F5E9C927-1F84-488B-BD02-01DAFDD369D9@mhcomputing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello PGSQL experts,

I've used your great database pretty heavily for the last 4 years, and during
that time it's helped me to solve an amazingly wide variety of data
challenges. Last week, I finally ran into something weird enough I couldn't
figure it out by myself. I'm using a self-compiled copy from latest 10.x
stable branch, Ubuntu 16.04 LTS, inserts with psycopg2, queries (so far) with
psql for testing, later JDBC (PostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled
by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609, 64-bit).

I have this great big table of strings, about 180 million rows. I want to be
able to search this table for substring matches and overall string similarity
against various inputs (an ideal use case for pg_trgm, from what I can see in
the docs and the research articles for such indexing).

I need a unique b-tree index on the strings, to prevent duplicates in the
input in the beginning, and from adding new strings in the future, and the
{gin,gist}_trgm_ops index to speed up the matching. I couldn't fully
understand from the docs if my use case was a better fit for GIN, or for GIST.
Some parts of the docs implied GIST would be faster, but only for less than
100K entries, at which point GIN would be faster. I am hoping someone could
comment.

Here is the table:

Unlogged table "public.huge_table"
Column | Type | Collation | Nullable | Default
-------------+--------------------------+-----------+----------+-----------------------------------------------
id | bigint | | not null | nextval('huge_table_id_seq'::regclass)
inserted_ts | timestamp with time zone | | | transaction_timestamp()
value | character varying | | |
Indexes:
"huge_table_pkey" PRIMARY KEY, btree (id)
"huge_table_value_idx" UNIQUE, btree (value)
"huge_table_value_trgm" gin (value gin_trgm_ops)

I managed to load the table initially in about 9 hours, after doing some
optimizations below based on various documentation (the server is 8-core Xeon
E5504, 16 GB RAM, 4 Hitachi 1TB 7200 RPM in a RAID 5 via Linux MD):

* compiled latest 10.x stable code branch from Git
* unlogged table (risky but made a big difference)
* shared_buffers 6 GB
* work_mem 32 MB
* maintenance_work_mem 512 MB
* effective_cache_size 10 GB
* synchronous_commit off
* wal_buffers 16 MB
* max_wal_size 4 GB
* checkpoint_completion_target 0.9
* auto_explain, and slow log for >= 1000 msecs (to debug this)

I'm noticing that the performance of inserts starts slipping quite a bit, as
the data is growing. It starts out fast, <1 sec per batch of 5000, but
eventually slows to 5-10 sec. per batch, sometimes randomly more.

In this example, it was just starting to slow, taking 4 secs to insert 5000
values:

2017-11-18 08:10:21 UTC [29578-11250] arceo(at)osint LOG: duration: 4034.901 ms plan:
Query Text: INSERT INTO huge_table (value) VALUES
('value1'),
... 4998 more values ...
('value5000')
ON CONFLICT (value) DO NOTHING
Insert on huge_table (cost=0.00..87.50 rows=5000 width=48)
Conflict Resolution: NOTHING
Conflict Arbiter Indexes: huge_table_value_idx
-> Values Scan on "*VALUES*" (cost=0.00..87.50 rows=5000 width=48)

When it's inserting, oddly enough, the postgres seems mostly CPU limited,
where I would have expected more of an IO limit personally, and the memory
isn't necessarily over-utilized either, so it makes me wonder if I missed some
things.

KiB Mem : 16232816 total, 159196 free, 487392 used, 15586228 buff/cache
KiB Swap: 93702144 total, 93382320 free, 319816 used. 8714944 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29578 postgres 20 0 6575672 6.149g 6.139g R 86.0 39.7 45:24.97 postgres

As for queries, doing a simple query like this one seems to require around 30
seconds to a minute. My volume is not crazy high but I am hoping I could get
this down to less than 30 seconds, because other stuff above this code will
start to time out otherwise:

osint=# explain analyze select * from huge_table where value ilike '%keyword%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on huge_table (cost=273.44..61690.09 rows=16702 width=33) (actual time=2897.847..58438.545 rows=16423 loops=1)
Recheck Cond: ((value)::text ~~* '%keyword%'::text)
Rows Removed by Index Recheck: 3
Heap Blocks: exact=5954
-> Bitmap Index Scan on huge_table_value_trgm (cost=0.00..269.26 rows=16702 width=0) (actual time=2888.846..2888.846 rows=16434 loops=1)
Index Cond: ((value)::text ~~* '%keyword%'::text)
Planning time: 0.252 ms
Execution time: 58442.413 ms
(8 rows)

Thanks for reading this and letting me know any recommendations.

Sincerely,
Matthew Hall

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2017-11-21 01:42:50 Re: insert and query performance on big string table with pg_trgm
Previous Message Stephen Frost 2017-11-20 15:34:02 Re: [ADMIN] Migration to pglister - Before