Re: Hunspell as filtering dictionary

From: Hugh Ranalli <hugh(at)whtc(dot)ca>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Hunspell as filtering dictionary
Date: 2019-11-06 15:49:48
Message-ID: CAAhbUMPEwNgvcVJRdta5RR3TVxNf6MGjhGms5RFS63gwYPVXhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 5 Nov 2019 at 09:42, Bibi Mansione <golgote(at)gmail(dot)com> wrote:

> Hi,
> I am trying to create a ts_vector from a French text. Here are the
> operations that seem logical to perform in that order:
>
> 1. remove stopwords
> 2. use hunspell to find words roots
> 3. unaccent
>

I can't speak to French, but we use a similar configuration in English,
with unaccent first, then hunspell. We found that there were words that
hunspell didn't recognise, but instead pulled apart (for example,
"contract" became "con" and "tract"), so I wonder if something similar is
happening with "découvrir." To solve this, we put a custom dictionary with
these terms in front of hunspell. Unaccent definitely has to be called
first. We also modified hunspell with a custom stopwords file, to eliminate
select other terms, such as profanities:

-- We use a custom stopwords file, to filter out other terms, such as
profanities
ALTER TEXT SEARCH DICTIONARY
hunspell_en_ca (
Stopwords = our_custom_stopwords
);

-- Adding english_stem allows us to recognize words which hunspell
-- doesn't, particularly acronyms such as CGA
ALTER TEXT SEARCH CONFIGURATION
our_configuration
ALTER MAPPING FOR
asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH
unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
;

There was definitely a fair bit of trial and error to determine the correct
order and configuration.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Robert Haas 2019-11-06 16:14:10 Re: v12 and pg_restore -f-
Previous Message Thomas Kellerer 2019-11-06 13:21:20 Re: Upgrade PGSQL main version without backup/restore all databases?