From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Christoph Gößmann <mail(at)goessmann(dot)io> |
Cc: | pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: How to drop all tokens that a snowball dictionary cannot stem? |
Date: | 2019-11-23 15:27:29 |
Message-ID: | CAMkU=1zS-+M4yeN_msxdd9u=PzS+Ne=SkKPxNrnVmvaw-Knr_w@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail(at)goessmann(dot)io> wrote:
> Hi everybody,
>
> I am trying to get all the lexemes for a text using to_tsvector(). But I
> want only words that english_stem -- the integrated snowball dictionary --
> is able to handle to show up in the final tsvector. Since snowball
> dictionaries only remove stop words, but keep the words that they cannot
> stem, I don't see an easy option to do this. Do you have any ideas?
>
> I went ahead with creating a new configuration:
>
> -- add new configuration english_led
> CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY =
> pg_catalog.english);
>
> -- dropping any words that contain numbers already in the parser
> ALTER TEXT SEARCH CONFIGURATION english_led
> DROP MAPPING FOR numword;
>
> EXAMPLE:
>
> SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt
> somejnk words');
> to_tsvector
> --------------------------------------------------
> 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
>
> In this tsvector, I would like 'somejnk' and 'tt' not to be included.
>
I don't think the question is well defined. It will happily stem
'somejnking' to ' somejnk', doesn't that mean that it **can** handle it?
The fact that 'somejnk' itself wasn't altered during stemming doesn't mean
it wasn't handled, just like 'test' wasn't altered during stemming.
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Christoph Gößmann | 2019-11-23 15:42:02 | Re: How to drop all tokens that a snowball dictionary cannot stem? |
Previous Message | Jason L. Amerson | 2019-11-23 15:09:45 | RE: Client Computers |