From: | Christoph Gößmann <mail(at)goessmann(dot)io> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | How to drop all tokens that a snowball dictionary cannot stem? |
Date: | 2019-11-22 13:01:18 |
Message-ID: | 50A531BE-8A5D-40BA-B6AF-4B9B32FB7FF3@goessmann.io |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi everybody,
I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?
I went ahead with creating a new configuration:
-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);
-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;
EXAMPLE:
SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
In this tsvector, I would like 'somejnk' and 'tt' not to be included.
Many thanks,
Christoph
From | Date | Subject | |
---|---|---|---|
Next Message | Moreno Andreo | 2019-11-22 13:13:44 | Re: [SPAM] Remote Connection Help |
Previous Message | Guillaume Lelarge | 2019-11-22 12:58:11 | Re: A question about user atributes |