From: | Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Cc: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
Subject: | Re: ranking how "similar" are tsvectors was: OR tsquery |
Date: | 2010-01-17 18:27:59 |
Message-ID: | 20100117192759.4b063416@dawn.webthatworks.it |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sun, 17 Jan 2010 20:19:59 +0300 (MSK)
Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> Ivan,
>
> You can write function to get lexemes from tsvector:
> CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT
> word text, OUT ndoc integer, OUT nentry integer)
> RETURNS SETOF record AS
> $$
> SELECT ts_stat('SELECT ' || quote_literal( $1::text ) ||
> '::tsvector', quote_literal( $2::text) ); $$ LANGUAGE SQL RETURNS
> NULL ON NULL INPUT IMMUTABLE;
Thanks very much Oleg.
Still it is not really making the pain go away.
I've weights stored in my tsvector and I need to build the query
using them.
This means that if I have:
'aubergine':4A 'orange':1B 'banana':5A 'apple':3C
and
'coconut':3B 'bananas':1A 'tomatoes:2C
stored in a column (tsv) I really would like to build up the query:
to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')
then
tsv
@@
to_tsquery('aubergine:A | orange:B | bananas:A | apple:C')
and relative ts_rank()
I'm aware that it is not symmetrical, but it looks as the cheapest
and fastest thing I can do right now.
I'm using pg_catalog.english. Am I supposing correctly that NO
lexeme will contain spaces?
If that is the case I could simply use string manipulation tools.
Not nice to see but it will work.
> Then, you can create ARRAY like:
>
> select ARRAY ( select (ts_stat(fts,'*')).word from papers where
> id=2);
>
> Then, you will have two arrays and you're free to apply any
> similarity function (cosine, jaccard,....) to calculate what do
> you want. If you want to preserve weights, then use weight label
> instead of '*'.
What ts_rank does is more than enough right now.
> Another idea is to use array_agg, but I'm not ready to discuss it.
>
> Please, keep in mind, that document similarity is a hot topic in
Not hard to imagine.
> IR, and, yes, I and Teodor have something about this, but code
> isn't available for public. Unfortunately, we had no sponsor for
> full-text search for last year and I see no perspectives this
> year, so we postpone our text-search development.
Good luck. Do you have anything like http://www.chipin.com/ for
small donations?
--
Ivan Sergio Borgonovo
http://www.webthatworks.it
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2010-01-17 19:06:19 | Re: Constraint exclusion issue |
Previous Message | Andy Colson | 2010-01-17 18:14:08 | Re: Data Generators |