From: | Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | ranking how "similar" are tsvectors was: OR tsquery |
Date: | 2010-01-17 16:56:24 |
Message-ID: | 20100117175624.315cfa55@dawn.webthatworks.it |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.
But at the end of the day I'd like to find similar tsvectors and
rank them.
I've a table containing several fields that contribute to build up a
weighted tsvector.
I'd like to pick up a tsvector and find which are the N most similar
ones.
I've found this:
http://domas.monkus.lt/document-similarity-postgresql
That's not really too far from what I was trying to do.
But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.
I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...
Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.
I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.
thanks
--
Ivan Sergio Borgonovo
http://www.webthatworks.it
From | Date | Subject | |
---|---|---|---|
Next Message | Oleg Bartunov | 2010-01-17 17:19:59 | Re: ranking how "similar" are tsvectors was: OR tsquery |
Previous Message | Dan Langille | 2010-01-17 16:29:11 | PGCon 2010 |