Re: Multilang text search. Is this correct?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Multilang text search. Is this correct?
Date: 2023-03-30 18:28:58
Message-ID: 257315.1680200938@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Florents Tselai <florents(dot)tselai(at)gmail(dot)com> writes:
> In my table (mix of text and jsonb columns ) I have text in multiple languages.
> I’d like search in all the possible regconfigs, so I’ve come up with the following recipe:

> CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS
> $$
> SELECT to_tsvector('english', $1) ||
> to_tsvector('greek', $1) ||
> to_tsvector('simple', $1)
> $$ LANGUAGE sql IMMUTABLE;

> Looks like it’s working as expected from my prelim tests, but anyone seeing any potential pitfalls?

Well, it'll work for some value of "work", but your tsvectors and
associated indexes will be bloated, and you're likely to get false
matches. (For example, common words like "the" will get indexed
and treated as matchable unless they're stopwords in all three
configurations.)

The usual compromise is to use just one config --- either 'simple',
or a specific language if you have a preponderance of data in that
language --- and tolerate the fact that searches won't be too bright
about things like reducing stemmed words.

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andrey Klochkov 2023-03-30 20:05:54 Re: Do BRIN indexes support MIN/MAX?
Previous Message Peter J. Holzer 2023-03-30 16:21:53 Re: Using CTID system column as a "temporary" primary key