Re: pg_trgm vs. Solr ngram

From: Bertrand Mamasam <golgote(at)gmail(dot)com>
To: Chris <rc(at)networkz(dot)ch>
Cc: "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_trgm vs. Solr ngram
Date: 2023-02-10 07:54:03
Message-ID: CACZ67_UD38B7r3ug5uSW2_TgtdGt+hPhhw0BnsM+bK1vQE_01g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Le ven. 10 févr. 2023, 03:20, Chris <rc(at)networkz(dot)ch> a écrit :

> Hello list
>
> I'm pondering migrating an FTS application from Solr to Postgres, just
> because we use Postgres for everything else.
>
> The application is basically fgrep with a web frontend. However the
> indexed documents are very computer network specific and contain a lot
> of hyphenated hostnames with dot-separated domains, as well as IPv4 and
> IPv6 addresses. In Solr I was using ngrams and customized the
> TokenizerFactories until more or less only whitespace was as separator,
> while [.:-_\d] remains part of the ngrams. This allows to search for
> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
>
> It looks like a straight conversion of this method is not possible since
> the tokenization in pg_trgm is not configurable afaict. Is there some
> other good method to search for a random substring including all the
> punctuation using an index? Or a pg_trgm-style module that is more
> flexible like the Solr/Lucene variant?
>
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
> I pretty much just need to change the emitted tokens or will this lead
> to significant complications in the operators, indexes etc.?
>
> thanks for any hints & cheers
> Christian
>

In Solr you used FTS so I suggest that you do the same in Postgres and look
at the full text search functions. You can create a tsvector yourself in
many different ways or use one of the provided functions. So you could add
complete IP adresses to your index and then search for them using something
like phrase search. You can also create text search configurations or just
use the "simple" one if you just need something like fgrep. Of course, the
end result will be more like Solr and less like fgrep.

https://www.postgresql.org/docs/current/textsearch.html

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Alban Hertroys 2023-02-10 08:14:19 Re: WHERE col = ANY($1) extended to 2 or more columns?
Previous Message Tom Lane 2023-02-10 05:13:46 Re: pg_trgm vs. Solr ngram