| From: | PG Doc comments form <noreply(at)postgresql(dot)org> |
|---|---|
| To: | pgsql-docs(at)lists(dot)postgresql(dot)org |
| Cc: | postgresql(at)richardneill(dot)org |
| Subject: | How to query the underlying dictionary i.e. inverse of ts_lexize() |
| Date: | 2019-04-06 18:59:27 |
| Message-ID: | 155457716793.719.16452998626279741513@wrigleys.postgresql.org |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-docs |
The following documentation comment has been logged on the website:
Page: https://www.postgresql.org/docs/11/textsearch-debugging.html
Description:
It would be helpful if there were some documentation on how to query the
dictionaries themselves, to get a canonical root word, either.
1. Directly, such as:
"SELECT words FROM english_stem WHERE stem = 'chlorin'
-- should return e.g. "chlorine", "chlorination", "chlorinated"
-- there isn't any documentation on how to actually do this.
2. Indirectly, such as:
"SELECT ts_unlexize('english_stem','chlorin');
-- this is a function which doesn't yet seem to exist: the one-to-many
inverse of ts_lexize().
3. Or, the canonical version of (2).
"SELECT ts_canonical('english_stem','chlorin');
--a one to one function to find the english root word (not the lexeme).
An example of where this is useful: consider a list of documents, containing
a large amount of english text.
For this example, consider that the following words are frequent: "the",
"kitten", "kittens", "chlorination", "chlorinated", "temperature" and
"something".
We wish to display a "tag cloud" of the most common terms, excluding
stopwords, by means of ts_stat().
At the moment, it lists:
"kitten" -- correctly treating "kitten" and "kittens" as the
same.
"chlorin" -- correctly merging "chlorination" and "chlorinated",
but creating a non-word.
"temperatur" -- right stem, not a word.
"someth" -- mistaken parser, has removed the -ing suffix.
So, given the array ["kitten","chlorin","temperatur","someth"], we wish to
un-stem to find the first valid english word whose stem is in that array,
i.e.
["kitten", "chlorine", "temperature", "something"]
Note that it is intentional to retrieve "chlorine" even though the original
inputs were "chlorinated" and "chlorination", and did not necessarily
contain "chlorine"]
There doesn't seem to be any process for doing this. Not sure whether this
is just something for the documentation, or an RFE for (2). Thanks very
much.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2019-04-08 12:25:07 | Re: initdb recommendations |
| Previous Message | Noah Misch | 2019-04-06 18:08:39 | Re: initdb recommendations |