Quick Links

pg_trgm vs. Solr ngram

From:	Chris <rc(at)networkz(dot)ch>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	pg_trgm vs. Solr ngram
Date:	2023-02-10 02:20:36
Message-ID:	4628c3f6-e2c5-1484-71cf-62446cec984d@networkz.ch
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello list

I'm pondering migrating an FTS application from Solr to Postgres, just
because we use Postgres for everything else.

The application is basically fgrep with a web frontend. However the
indexed documents are very computer network specific and contain a lot
of hyphenated hostnames with dot-separated domains, as well as IPv4 and
IPv6 addresses. In Solr I was using ngrams and customized the
TokenizerFactories until more or less only whitespace was as separator,
while [.:-_\d] remains part of the ngrams. This allows to search for
".12.255/32" or "xzy-eth5.example.org" without any false positives.

It looks like a straight conversion of this method is not possible since
the tokenization in pg_trgm is not configurable afaict. Is there some
other good method to search for a random substring including all the
punctuation using an index? Or a pg_trgm-style module that is more
flexible like the Solr/Lucene variant?

Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
I pretty much just need to change the emitted tokens or will this lead
to significant complications in the operators, indexes etc.?

thanks for any hints & cheers
Christian

Responses

Re: pg_trgm vs. Solr ngram at 2023-02-10 03:48:48 from Laurenz Albe
Re: pg_trgm vs. Solr ngram at 2023-02-10 05:13:46 from Tom Lane
Re: pg_trgm vs. Solr ngram at 2023-02-10 07:54:03 from Bertrand Mamasam

Browse pgsql-general by date

	From	Date	Subject
Next Message	Laurenz Albe	2023-02-10 03:48:48	Re: pg_trgm vs. Solr ngram
Previous Message	Peter Geoghegan	2023-02-10 01:15:02	Re: ERROR: posting list tuple with 2 items cannot be split at offset 17