Re: pg_trgm vs. Solr ngram

From: "Christian Ramseyer (mx04)" <rc(at)networkz(dot)ch>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: pg_trgm vs. Solr ngram
Date: 2023-02-12 00:13:52
Message-ID: caa259d8-9a9a-ecfe-10a0-cd2f65345d3e@networkz.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10.02.23 04:48, Laurenz Albe wrote:
> On Fri, 2023-02-10 at 03:20 +0100, Chris wrote:
>> In Solr I was using ngrams and customized the
>> TokenizerFactories until more or less only whitespace was as separator,
>> while [.:-_\d] remains part of the ngrams. This allows to search for
>> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
>>
>> It looks like a straight conversion of this method is not possible

On 10.02.23 04:48, Laurenz Albe wrote:
> Here is a hack that you can try: pre-process your strings and replace
> symbols with rare characters:
>
> SELECT show_trgm(translate('127.0.0.1/32', './', 'qx'));
>
> show_trgm
> ═════════════════════════════════════════════════════════
> {" 1"," 12",0q0,0q1,127,1x3,27q,"32 ",7q0,q0q,q1x,x32}
> (1 row)
>
> Then you could search like
>
> WHERE translate(search_string, './', 'qx') LIKE translate('%127.0.0.1/32%', './', 'qx')
> AND search_string LIKE '%127.0.0.1/32%'
>
> The first condition can use a trigram index, and the second filters out
> false positives.

Hehe that is a nifty idea. I went to try this but then it turned out
that I was probably overthinking the whole issue already. Using a
gist_trgm_ops index and % as operator works perfectly well:

insert into docs (orig) values ('120.2.10.22');
insert into docs (orig) values ('120 2 10 22');
CREATE INDEX iorig ON docs USING GIST (orig gist_trgm_ops);

set enable_seqscan = off;
explain analyze verbose select * from docs where orig like '%.10.22%';

Index Scan using iorig on public.docs (cost=0.14..8.16 rows=1 width=32)
(actual time=0.952..1.018 rows=1 loops=1)
Output: orig
Index Cond: (docs.orig ~~ '%.10.22%'::text)
Rows Removed by Index Recheck: 1

Even though this query has the same trigrams like e.g. '% 10 22%', the
index recheck takes care of it and only the matching row is returned.
Excellent, not quite sure why I was expecting false positives in the
first place, it would be a pretty stark violation of how % is supposed
to behave.

Not quite sure how big the performance hit of not having the optimal
trigrams with punctuation in the index and rechecking some hits will be,
but for now I'll assume it's negligible. Otherwise I'll try the
translate variant.

Many thanks!

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Peter 2023-02-12 00:42:35 Queries running forever, because of wrong rowcount estimate
Previous Message Peter J. Holzer 2023-02-11 22:00:31 Re: WHERE col = ANY($1) extended to 2 or more columns?