From: | Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> |
---|---|
To: | Chris <rc(at)networkz(dot)ch>, pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: pg_trgm vs. Solr ngram |
Date: | 2023-02-10 03:48:48 |
Message-ID: | c9d6661b3936316e34054b95a0dbf507914ad051.camel@cybertec.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, 2023-02-10 at 03:20 +0100, Chris wrote:
> I'm pondering migrating an FTS application from Solr to Postgres, just
> because we use Postgres for everything else.
>
> The application is basically fgrep with a web frontend. However the
> indexed documents are very computer network specific and contain a lot
> of hyphenated hostnames with dot-separated domains, as well as IPv4 and
> IPv6 addresses. In Solr I was using ngrams and customized the
> TokenizerFactories until more or less only whitespace was as separator,
> while [.:-_\d] remains part of the ngrams. This allows to search for
> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
>
> It looks like a straight conversion of this method is not possible since
> the tokenization in pg_trgm is not configurable afaict. Is there some
> other good method to search for a random substring including all the
> punctuation using an index? Or a pg_trgm-style module that is more
> flexible like the Solr/Lucene variant?'127.0.0.1/32'
>
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
> I pretty much just need to change the emitted tokens or will this lead
> to significant complications in the operators, indexes etc.?
Here is a hack that you can try: pre-process your strings and replace
symbols with rare characters:
SELECT show_trgm(translate('127.0.0.1/32', './', 'qx'));
show_trgm
═════════════════════════════════════════════════════════
{" 1"," 12",0q0,0q1,127,1x3,27q,"32 ",7q0,q0q,q1x,x32}
(1 row)
Then you could search like
WHERE translate(search_string, './', 'qx') LIKE translate('%127.0.0.1/32%', './', 'qx')
AND search_string LIKE '%127.0.0.1/32%'
The first condition can use a trigram index, and the second filters out
false positives.
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-02-10 05:13:46 | Re: pg_trgm vs. Solr ngram |
Previous Message | Chris | 2023-02-10 02:20:36 | pg_trgm vs. Solr ngram |