Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters

From: pgsql-performance(at)jhacker(dot)de
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
Date: 2021-12-08 18:43:03
Message-ID: 144fe629-99b1-c773-7d37-5c1fca5d911a@mailbox.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thank you both a lot for the insights and your input.

> Yeah, this test case seems very unrealistic, both as to table size
> and as to the lack of variability of the table entries.

The example was based on real data with a more complicated query which
prompted me to investigate the issue. The distinction between slow and
fast queries is not as clear cut as with the generated data, but the
general problem remains.

>> Since you have SSDs, you should tune "random_page_cost = 1.1".

I tested different values of random_page_cost with various queries. Too
small values increased the execution time again, due to too eager index
usage. I identified the optimum for my use case at 1.4. This solved my
problem, thanks.

Regards
Jonathan

On 07.12.21 18:08, Tom Lane wrote:
> Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> writes:
>> On Tue, 2021-11-30 at 22:38 +0100, pgsql-performance(at)jhacker(dot)de wrote:
>>> INSERT INTO song (artist, title)
>>> SELECT 'artist','title'
>>> FROM generate_series(1,10000);
>>>
>>> \set query '12345678'
>>>
>>> -- This query is slow
>>> EXPLAIN ANALYZE
>>> SELECT song.artist, song.title
>>> FROM song
>>> WHERE (song.artist %> :'query' OR song.title %> :'query')
>>> ;
>
>> The table is quite small; with a bigger table, the test would be more meaningful.
>
> Yeah, this test case seems very unrealistic, both as to table size
> and as to the lack of variability of the table entries. I think the
> latter is causing the indexscans to take less time than they otherwise
> might, because none of the extracted trigrams find any matches.
>
>> Since you have SSDs, you should tune "random_page_cost = 1.1".
>
> Right. Poking at gincostestimate a bit, I see that for this
> operator the indexscan cost estimate is basically driven by the
> number of trigrams extracted from the query string (nine in this
> test case) and the index size; those lead to a predicted number
> of index page fetches that's then scaled by random_page_cost.
> That's coming out to make it look more expensive than the seqscan.
> It's actually not more expensive, but that's partially because
> page fetch costs are really zero in this test case (everything
> will stay in shared buffers the whole time), and partially because
> the unrealistic data pattern is leading to not having to look at
> as much of the index as gincostestimate expected.
>
> In general, it appears correct that longer query strings lead to a
> higher index cost estimate, because they produce more trigrams so
> there's more work for the index match to do. (At some level, a
> longer query means more work in the seqscan case too; but our cost
> models are inadequate to predict that.)
>
> regards, tom lane
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Imre Samu 2021-12-10 18:21:26 Re: PostgreSQLv14 TPC-H performance GCC vs Clang
Previous Message Tom Lane 2021-12-07 17:08:05 Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters