ranking how "similar" are tsvectors was: OR tsquery

From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: ranking how "similar" are tsvectors was: OR tsquery
Date: 2010-01-17 16:56:24
Message-ID: 20100117175624.315cfa55@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.

But at the end of the day I'd like to find similar tsvectors and
rank them.

I've a table containing several fields that contribute to build up a
weighted tsvector.

I'd like to pick up a tsvector and find which are the N most similar
ones.

I've found this:

http://domas.monkus.lt/document-similarity-postgresql

That's not really too far from what I was trying to do.

But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.

I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...

Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.

I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2010-01-17 17:19:59 Re: ranking how "similar" are tsvectors was: OR tsquery
Previous Message Dan Langille 2010-01-17 16:29:11 PGCon 2010