Re: How to drop all tokens that a snowball dictionary cannot stem?

From: Christoph Gößmann <mail(at)goessmann(dot)io>
To: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: How to drop all tokens that a snowball dictionary cannot stem?
Date: 2019-11-23 15:42:02
Message-ID: 24BF16F6-1397-44A8-885F-99B6009B04E2@goessmann.io
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Jeff,

You're right about that point. Let me redefine. I would like to drop all tokens which neither are the stemmed or unstemmed version of a known word. Would there be the possibility of putting a wordlist as a filter ahead of the stemming? Or do you know about a good English lexeme list that could be used to filter after stemming?

Thanks,
Christoph

> On 23. Nov 2019, at 16:27, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail(at)goessmann(dot)io <mailto:mail(at)goessmann(dot)io>> wrote:
> Hi everybody,
>
> I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?
>
> I went ahead with creating a new configuration:
>
> -- add new configuration english_led
> CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);
>
> -- dropping any words that contain numbers already in the parser
> ALTER TEXT SEARCH CONFIGURATION english_led
> DROP MAPPING FOR numword;
>
> EXAMPLE:
>
> SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
> to_tsvector
> --------------------------------------------------
> 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
>
> In this tsvector, I would like 'somejnk' and 'tt' not to be included.
>
> I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming.
>
> Cheers,
>
> Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2019-11-23 17:02:59 Re: Constants in the foreighn key constraints
Previous Message Jeff Janes 2019-11-23 15:27:29 Re: How to drop all tokens that a snowball dictionary cannot stem?