How to query the underlying dictionary i.e. inverse of ts_lexize()

From: PG Doc comments form <noreply(at)postgresql(dot)org>
To: pgsql-docs(at)lists(dot)postgresql(dot)org
Cc: postgresql(at)richardneill(dot)org
Subject: How to query the underlying dictionary i.e. inverse of ts_lexize()
Date: 2019-04-06 18:59:27
Message-ID: 155457716793.719.16452998626279741513@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/textsearch-debugging.html
Description:

It would be helpful if there were some documentation on how to query the
dictionaries themselves, to get a canonical root word, either.

1. Directly, such as:
"SELECT words FROM english_stem WHERE stem = 'chlorin'
-- should return e.g. "chlorine", "chlorination", "chlorinated"
-- there isn't any documentation on how to actually do this.

2. Indirectly, such as:
"SELECT ts_unlexize('english_stem','chlorin');
-- this is a function which doesn't yet seem to exist: the one-to-many
inverse of ts_lexize().

3. Or, the canonical version of (2).
"SELECT ts_canonical('english_stem','chlorin');
--a one to one function to find the english root word (not the lexeme).

An example of where this is useful: consider a list of documents, containing
a large amount of english text.
For this example, consider that the following words are frequent: "the",
"kitten", "kittens", "chlorination", "chlorinated", "temperature" and
"something".

We wish to display a "tag cloud" of the most common terms, excluding
stopwords, by means of ts_stat().
At the moment, it lists:
"kitten" -- correctly treating "kitten" and "kittens" as the
same.
"chlorin" -- correctly merging "chlorination" and "chlorinated",
but creating a non-word.
"temperatur" -- right stem, not a word.
"someth" -- mistaken parser, has removed the -ing suffix.

So, given the array ["kitten","chlorin","temperatur","someth"], we wish to
un-stem to find the first valid english word whose stem is in that array,
i.e.
["kitten", "chlorine", "temperature", "something"]
Note that it is intentional to retrieve "chlorine" even though the original
inputs were "chlorinated" and "chlorination", and did not necessarily
contain "chlorine"]

There doesn't seem to be any process for doing this. Not sure whether this
is just something for the documentation, or an RFE for (2). Thanks very
much.

Browse pgsql-docs by date

  From Date Subject
Next Message Peter Eisentraut 2019-04-08 12:25:07 Re: initdb recommendations
Previous Message Noah Misch 2019-04-06 18:08:39 Re: initdb recommendations