Re: ranking how "similar" are tsvectors was: OR tsquery

From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: ranking how "similar" are tsvectors was: OR tsquery
Date: 2010-01-17 18:27:59
Message-ID: 20100117192759.4b063416@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, 17 Jan 2010 20:19:59 +0300 (MSK)
Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:

> Ivan,
>
> You can write function to get lexemes from tsvector:

> CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT
> word text, OUT ndoc integer, OUT nentry integer)
> RETURNS SETOF record AS
> $$
> SELECT ts_stat('SELECT ' || quote_literal( $1::text ) ||
> '::tsvector', quote_literal( $2::text) ); $$ LANGUAGE SQL RETURNS
> NULL ON NULL INPUT IMMUTABLE;

Thanks very much Oleg.

Still it is not really making the pain go away.
I've weights stored in my tsvector and I need to build the query
using them.

This means that if I have:
'aubergine':4A 'orange':1B 'banana':5A 'apple':3C
and
'coconut':3B 'bananas':1A 'tomatoes:2C
stored in a column (tsv) I really would like to build up the query:

to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')

then

tsv
@@
to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')

and relative ts_rank()

I'm aware that it is not symmetrical, but it looks as the cheapest
and fastest thing I can do right now.

I'm using pg_catalog.english. Am I supposing correctly that NO
lexeme will contain spaces?

If that is the case I could simply use string manipulation tools.
Not nice to see but it will work.

> Then, you can create ARRAY like:
>
> select ARRAY ( select (ts_stat(fts,'*')).word from papers where
> id=2);
>
> Then, you will have two arrays and you're free to apply any
> similarity function (cosine, jaccard,....) to calculate what do
> you want. If you want to preserve weights, then use weight label
> instead of '*'.

What ts_rank does is more than enough right now.

> Another idea is to use array_agg, but I'm not ready to discuss it.
>
> Please, keep in mind, that document similarity is a hot topic in

Not hard to imagine.

> IR, and, yes, I and Teodor have something about this, but code
> isn't available for public. Unfortunately, we had no sponsor for
> full-text search for last year and I see no perspectives this
> year, so we postpone our text-search development.

Good luck. Do you have anything like http://www.chipin.com/ for
small donations?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jeff Davis 2010-01-17 19:06:19 Re: Constraint exclusion issue
Previous Message Andy Colson 2010-01-17 18:14:08 Re: Data Generators