From: | Dominik Giger <dominik(dot)giger(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #16235: ts_rank ignores match and only considers lower weighted vector |
Date: | 2020-01-28 10:50:20 |
Message-ID: | CAGFNN0Y1KP_tjeAvaHqYr6fR3kEngbQeAyFaj7wF+1NaUEUAqw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, Jan 27, 2020 at 11:35 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> > The following query shows the problem:
>
> > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as
> > rank_correct
> > from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
> > setweight(to_tsvector('simple', 'foobar'), 'C') as doc1,
> > setweight(to_tsvector('simple', 'foo something'), 'A') as
> > doc2,
> > to_tsquery('simple', 'foo:* & something') as
> > query) as subquery;
>
> > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only
> > consider the 'foobar' term with lower weight when calculating the rank. The
> > foo:1A is only considered in doc2.
>
> No, that's not correct. What it actually is doing is taking some sort of
> average of the weights of the occurrences, as you can see if you play
> around with a few more examples besides these two. That could be better
> documented, perhaps, but I don't think it's obviously broken.
>
> I can see that there might be a use for taking the max or even the sum
> of the weights rather than an average --- in many situations it wouldn't
> be desirable to rank doc1 of your example lower than doc2. But really
> that'd be a different ranking algorithm, not a bug fix for this one.
>
> The manual claims you can write your own ranking algorithm ... but
> AFAICS you'd have to code it in C, because we aren't exposing anything
> at SQL level that would let you get at the raw match data :-(.
> So there's room for improvement there.
>
> Also, you might try using ts_rank_cd() instead, as that uses a different
> algorithm for combining the weights. At least on this example, doc1
> gets a higher score than doc2.
>
> regards, tom lane
I see, thank you for the explanation.
Maybe I can add another reason why I think it might be a bug. Consider
this query:
select ts_rank(doc1, query) as rank_wrong,
ts_rank(doc2, query) as rank_correct
from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
setweight(to_tsvector('simple', 'foobar'), 'C') as doc1,
setweight(to_tsvector('simple', 'foo something'), 'A') as doc2,
to_tsquery('simple', 'foo:*') as
query) as subquery;
Here I only removed the '& something' part of the query. Now the query
behaves as one would expect: The first rank is higher than the second.
I am unsure why adding a second search term (which is contained in
both documents) would lead to a change in the ranking order.
What do you think?
Regards,
Dominik Giger
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2020-01-28 13:28:15 | BUG #16237: When restoring database, backend disconnects or crashes when foreign key is created |
Previous Message | Johann du Toit | 2020-01-28 09:34:15 | Re: BUG #16233: Yet another "logical replication worker" was terminated by signal 11: Segmentation fault |