From: | Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru> |
---|---|
To: | Emre Hasegeli <emre(at)hasegeli(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> |
Subject: | Re: Flexible configuration for full-text search |
Date: | 2017-10-30 12:40:32 |
Message-ID: | 20171030154032.5447672c@asp437-24-g082ur |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.
On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli <emre(at)hasegeli(dot)com> wrote:
> To put it formally:
>
> ALTER TEXT SEARCH CONFIGURATION name
> ADD MAPPING FOR token_type [, ... ] WITH config
>
> where config is one of:
>
> dictionary_name
> config { UNION | INTERSECT | EXCEPT } config
> CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END
According to formal definition following configurations are valid:
CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END
But configuration:
CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END
is not (as I understand ELSE can be used only with KEEP).
I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.
> > 3) Using different dictionaries for recognizing and output
> > generation. As I mentioned before, in new syntax condition and
> > command are separate and we can use it for some more complex text
> > processing. Here an example for processing only nouns:
> >
> > ALTER TEXT SEARCH CONFIGURATION nouns_only
> > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> > word, hword, hword_part WITH CASE
> > WHEN english_noun THEN english_hunspell
> > END
>
> This would also still work with the simpler syntax because
> "english_noun", still being a dictionary, would pass the tokens to the
> next one.
Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END
The question is same as in the previous example.
> Instead of supporting old way of putting stopwords on dictionaries, we
> can make them dictionaries on their own. This would then become
> something like:
>
> CASE polish_stopword
> WHEN NO MATCH THEN polish_isspell
> END
Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------
'messag':3 'test':2
A stopword 'a' has a position 1 but it is not in the vector.
If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.
One of possible way to do so is something like:
CASE polish_stopword
WHEN MATCH THEN KEEP -- stopword counting
ELSE polish_isspell
END
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2017-10-30 13:07:48 | Re: MERGE SQL Statement for PG11 |
Previous Message | Alvaro Herrera | 2017-10-30 12:37:16 | Re: pow support for pgbench |