pg_trgm word_similarity query does not use index for input strings longer than 8 characters

From: pgsql-performance(at)jhacker(dot)de
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
Date: 2021-11-30 21:38:32
Message-ID: e76c242b-2d78-813f-2f95-5ba5c0bf3b5c@mailbox.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

recently I wrote a query that provides suggestions from a Postgres table.
It should be able to work despite smaller typos and thus I chose to use
the pg_trgm extension (https://www.postgresql.org/docs/current/pgtrgm.html)
When measuring the performance, I observed great differences in the
query time, depending on the input string.
Analysis showed that Postgres sometimes used the created indexes and
sometimes it didn't, even though it would provide a considerable speedup.

In the included test case the degradation occurs for all input strings
of length 8 or longer, for shorter strings the index is used.

My questions:
Why doesn't the query planner choose to use the index?
Can I make Postgres use the index, and if so, how?
I understand that trying to outsmart the planner is generally a bad
idea. Maybe the query can be rewritten or there are some parameters that
could be tweaked.

## Setup Information

Hardware: Intel i5-8250U, 8GB RAM, encrypted SSD, no RAID
$ uname -a
Linux 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC
2021 x86_64 x86_64 x86_64 GNU/Linux

Software:
OS: Ubuntu 20.04
Postgres: PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110,
64-bit
The Postgres docker image was used.
Docker: Docker version 20.10.5, build 55c4c88
Image used: postgres:14.1

Configuration:
The config file was not changed.
name | current_setting | source
----------------------------+--------------------+----------------------
application_name | psql | client
client_encoding | UTF8 | client
DateStyle | ISO, MDY | configuration file
default_text_search_config | pg_catalog.english | configuration file
dynamic_shared_memory_type | posix | configuration file
enable_seqscan | off | session
lc_messages | en_US.utf8 | configuration file
lc_monetary | en_US.utf8 | configuration file
lc_numeric | en_US.utf8 | configuration file
lc_time | en_US.utf8 | configuration file
listen_addresses | * | configuration file
log_timezone | Etc/UTC | configuration file
max_connections | 100 | configuration file
max_stack_depth | 2MB | environment variable
max_wal_size | 1GB | configuration file
min_wal_size | 80MB | configuration file
shared_buffers | 128MB | configuration file
TimeZone | Etc/UTC | configuration file

## Test Case
The test case creates a simple table and fills it with 10000 identical
entries.
The query is executed twice with an 8 character string, once with
sequential scans enabled, and once with sequential scans disabled.
The first query does not use the index, even if the second query shows
that it would be much faster.

docker run --name postgres -e POSTGRES_PASSWORD=postgres -d postgres:14.1
docker exec -it postgres bash
psql -U postgres

CREATE EXTENSION pg_trgm;

CREATE TABLE song (
artist varchar(20),
title varchar(20)
);

INSERT INTO song (artist, title)
SELECT 'artist','title'
FROM generate_series(1,10000);

CREATE INDEX artist_trgm ON song USING GIN (artist gin_trgm_ops);
CREATE INDEX title_trgm ON song USING GIN (title gin_trgm_ops);

-- Tips from https://wiki.postgresql.org/wiki/Slow_Query_Questions
ANALYZE;
VACUUM;
REINDEX TABLE song;

\set query '12345678'

-- This query is slow
EXPLAIN ANALYZE
SELECT song.artist, song.title
FROM song
WHERE (song.artist %> :'query' OR song.title %> :'query')
;

set enable_seqscan=off;

-- This query is fast
EXPLAIN ANALYZE
SELECT song.artist, song.title
FROM song
WHERE (song.artist %> :'query' OR song.title %> :'query')
;

## Additional Test Case Info

Schemata:
Table "public.song"
Column | Type | Collation | Nullable | Default |
Storage | Compression | Stats target | Description
--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------
artist | character varying(20) | | | |
extended | | |
title | character varying(20) | | | |
extended | | |
Indexes:
"artist_trgm" gin (artist gin_trgm_ops)
"title_trgm" gin (title gin_trgm_ops)
Access method: heap
Index "public.artist_trgm"
Column | Type | Key? | Definition | Storage | Stats target
--------+---------+------+------------+---------+--------------
artist | integer | yes | artist | plain |
gin, for table "public.song"
Index "public.title_trgm"
Column | Type | Key? | Definition | Storage | Stats target
--------+---------+------+------------+---------+--------------
title | integer | yes | title | plain |
gin, for table "public.song"

Table Metadata:
postgres=# SELECT relname, relpages, reltuples, relallvisible, relkind,
relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class
WHERE relname='song';
relname | relpages | reltuples | relallvisible | relkind | relnatts |
relhassubclass | reloptions | pg_table_size
---------+----------+-----------+---------------+---------+----------+----------------+------------+---------------
song | 55 | 10000 | 55 | r | 2 |
f | | 483328

EXPLAIN ANALYZE of the "slow" query
QUERY PLAN
---------------------------------------------------------------------------------------------------
Seq Scan on song (cost=0.00..205.00 rows=1 width=13) (actual
time=68.896..68.897 rows=0 loops=1)
Filter: (((artist)::text %> '12345678'::text) OR ((title)::text %>
'12345678'::text))
Rows Removed by Filter: 10000
Planning Time: 0.304 ms
Execution Time: 68.928 ms

EXPLAIN ANALYZE of the "fast" query
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on song (cost=288.00..292.02 rows=1 width=13)
(actual time=0.023..0.024 rows=0 loops=1)
Recheck Cond: (((artist)::text %> '12345678'::text) OR
((title)::text %> '12345678'::text))
-> BitmapOr (cost=288.00..288.00 rows=1 width=0) (actual
time=0.022..0.023 rows=0 loops=1)
-> Bitmap Index Scan on artist_trgm (cost=0.00..144.00
rows=1 width=0) (actual time=0.013..0.014 rows=0 loops=1)
Index Cond: ((artist)::text %> '12345678'::text)
-> Bitmap Index Scan on title_trgm (cost=0.00..144.00 rows=1
width=0) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((title)::text %> '12345678'::text)
Planning Time: 0.224 ms
Execution Time: 0.052 ms

The behaviour is identical when using similarity instead of word_similarity.
GIN indexes were chosen because the table is queried far more often than
it is updated.
I tried increasing shared_buffers, effective_cache_size or work_mem to
no avail.

Any help would be greatly appreciated.

Regards
Jonathan

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ashkil Dighin 2021-12-01 14:26:11 LwLockRelease performance
Previous Message Robert Creager 2021-11-29 23:04:30 Re: Need help identifying a periodic performance issue.