From: | Emre Hasegeli <emre(at)hasegeli(dot)com> |
---|---|
To: | Aleksandr Parfenov <a(dot)parfenov(at)postgrespro(dot)ru> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> |
Subject: | Re: Flexible configuration for full-text search |
Date: | 2017-10-26 18:01:14 |
Message-ID: | CAE2gYzwAeuNB=e1tvM826CxFrons5beEYqTshdo2HOMTQb9XKg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
> construction.
Interesting feature. I needed this flexibility before when I was
implementing text search for a Turkish private listing application.
Aleksandr and Arthur were kind enough to discuss it with me off-list
today.
> 1) Multilingual search. Can be used for FTS on a set of documents in
> different languages (example for German and English languages).
>
> ALTER TEXT SEARCH CONFIGURATION multi
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
> WHEN english_hunspell AND german_hunspell THEN
> english_hunspell UNION german_hunspell
> WHEN english_hunspell THEN english_hunspell
> WHEN german_hunspell THEN german_hunspell
> ELSE german_stem UNION english_stem
> END;
I understand the need to support branching, but this syntax is overly
complicated. I don't think there is any need to support different set
of dictionaries as condition and action. Something like this might
work better:
ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH
CASE english_hunspell UNION german_hunspell
WHEN MATCH THEN KEEP
ELSE german_stem UNION english_stem
END;
To put it formally:
ALTER TEXT SEARCH CONFIGURATION name
ADD MAPPING FOR token_type [, ... ] WITH config
where config is one of:
dictionary_name
config { UNION | INTERSECT | EXCEPT } config
CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END
> 2) Combination of exact search with morphological one. This patch not
> fully solve the problem but it is a step toward solution. Currently, we
> should split exact and morphological search in query manually and use
> separate index for each part. With new way to configure FTS we can use
> following configuration:
>
> ALTER TEXT SEARCH CONFIGURATION exact_and_morph
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
> WHEN english_hunspell THEN english_hunspell UNION simple
> ELSE english_stem UNION simple
> END
This could be:
CASE english_hunspell
THEN KEEP
ELSE english_stem
END
UNION
simple
> 3) Using different dictionaries for recognizing and output generation.
> As I mentioned before, in new syntax condition and command are separate
> and we can use it for some more complex text processing. Here an
> example for processing only nouns:
>
> ALTER TEXT SEARCH CONFIGURATION nouns_only
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
> WHEN english_noun THEN english_hunspell
> END
This would also still work with the simpler syntax because
"english_noun", still being a dictionary, would pass the tokens to the
next one.
> 4) Special stopword processing allows us to discard stopwords even if
> the main dictionary doesn't support such feature (in example pl_ispell
> dictionary keeps stopwords in text):
>
> ALTER TEXT SEARCH CONFIGURATION pl_without_stops
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
> WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
> END
Instead of supporting old way of putting stopwords on dictionaries, we
can make them dictionaries on their own. This would then become
something like:
CASE polish_stopword
WHEN NO MATCH THEN polish_isspell
END
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2017-10-26 18:03:41 | Re: Timeline ID in backup_label file |
Previous Message | Alvaro Herrera | 2017-10-26 17:51:08 | Re: taking stdbool.h into use |