Re: How to drop all tokens that a snowball dictionary cannot stem?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Christoph Gößmann <mail(at)goessmann(dot)io>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: How to drop all tokens that a snowball dictionary cannot stem?
Date: 2019-11-23 15:27:29
Message-ID: CAMkU=1zS-+M4yeN_msxdd9u=PzS+Ne=SkKPxNrnVmvaw-Knr_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail(at)goessmann(dot)io> wrote:

> Hi everybody,
>
> I am trying to get all the lexemes for a text using to_tsvector(). But I
> want only words that english_stem -- the integrated snowball dictionary --
> is able to handle to show up in the final tsvector. Since snowball
> dictionaries only remove stop words, but keep the words that they cannot
> stem, I don't see an easy option to do this. Do you have any ideas?
>
> I went ahead with creating a new configuration:
>
> -- add new configuration english_led
> CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY =
> pg_catalog.english);
>
> -- dropping any words that contain numbers already in the parser
> ALTER TEXT SEARCH CONFIGURATION english_led
> DROP MAPPING FOR numword;
>
> EXAMPLE:
>
> SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt
> somejnk words');
> to_tsvector
> --------------------------------------------------
> 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
>
> In this tsvector, I would like 'somejnk' and 'tt' not to be included.
>

I don't think the question is well defined. It will happily stem
'somejnking' to ' somejnk', doesn't that mean that it **can** handle it?
The fact that 'somejnk' itself wasn't altered during stemming doesn't mean
it wasn't handled, just like 'test' wasn't altered during stemming.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Christoph Gößmann 2019-11-23 15:42:02 Re: How to drop all tokens that a snowball dictionary cannot stem?
Previous Message Jason L. Amerson 2019-11-23 15:09:45 RE: Client Computers