From: | Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> |
---|---|
To: | |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: FTS uses "tsquery" directly in the query |
Date: | 2010-01-25 16:35:47 |
Message-ID: | 20100125173547.47d8d7a3@dawn.webthatworks.it |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 25 Jan 2010 07:19:59 -0800 (PST)
xu fei <autofei(at)yahoo(dot)com> wrote:
> Hi, Oleg Bartunov:
> First thanks for your quick replay. Could you explain it a little
> more on "it's general limitation/feature"? I just confuse that
> to_tsquery('item') function will return a tsquery type which is
> same as 'item'::tsquery, to my understanding. Let me explain what
> I want:First Step: extract top K tokensI have a table with a
> column as tsvector type. Some records in this column are too big,
> which contain hundreds tokens. I just want the top K tokens based
> on the frequency, for example top 5. I am not sure there is a
> direct way to get such kind top K tokens. I just read them out in
> Java and count frequency for each token and sort them. Second
> Step: generate queryNow I will use these tokens to construct a
> query to search other vectors in the same table. I can not
> directly use to_tsquery() due to two reasons: 1) The default logic
> operator in to_tsquery() is "&" but what I need it "|". 2) Since
> the tokens are from tsvector, they are already normalized. If I
> use to_tsquery() again, they will be normalized again! For
> example, “course” -> “cours” -> “cour”. So I just concatenate the
> top K tokens with “|” and directly use "::tsquery ".
> Unfortunately, as you say "it's general limitation/feature”, I can
> not do that. I checked your manual “Full-Text Search
> in PostgreSQL A Gentle Introduction”, but could not figure out
> how. So is it possible to implement what I want in FTS? If so,
> how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov
You're trying to solve a similar problems than mine.
I'd like to build up a näive similar text search.
I don't have the "length" problem still I'd like to avoid to
tokenize/lexize a text twice to build up a tsquery.
I've weighted tsvectors stored in a column and once I pick up one
I'd like to look for similar ones in the same column.
There are thousands way to measure text similarity (and Oleg pointed
me to some), still ts_rank should be "good enough for me".
I've very short text so I can't use & on the whole tsvector
otherwise there will be very high chances to find just one match.
As you suggested I could pick up a subset of "important"[1] lexemes
in the tsvector and build up an "&"ed tsquery with them.
Still at least in my case, since I'm dealing with very short texts,
this still looks too risky (just 1 match). Considering that I'm
using weighted tsvectors it seems that "|" and picking up the ones
with the best rank could be a way to go.
But as you've noted there is no function that turns a tsvector in a
tsquery (including weight possibly) and give you the choice to use
"|".
Well... I'm trying to write a couple of helper functions in C.
But I'm pretty new to postgres internals and well I miss a reference
of functions/macro with some examples... and this is a side project
and I haven't been using C for quite a while.
Once I'll have that function I'll have to solve how to return few
rows (since I'll have to use | I expect a lot of returned rows) to
make efficient use of the gin index and avoid to compute ts_rank for
too many rows.
Don't hold your breath waiting... but let me know if you're
interested so I don't have to be the only one posting newbies
questions on pgsql-hackers ;)
[1] ts_stat could give you some hints about what lexemes may be
important... but well deciding what's important is another can of
worms... and as anticipated ts_rank should be "good enough for me".
--
Ivan Sergio Borgonovo
http://www.webthatworks.it
From | Date | Subject | |
---|---|---|---|
Next Message | DM | 2010-01-25 16:51:13 | Re: port question |
Previous Message | Joshua D. Drake | 2010-01-25 16:31:28 | Re: Log full of: statement_timeout out of the valid range. |