From: | Sushant Sinha <sushant354(at)gmail(dot)com> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [GENERAL] Fragments in tsearch2 headline |
Date: | 2008-06-21 14:00:53 |
Message-ID: | 1214056853.8689.10.camel@dragflick |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
I have an attached an updated patch with following changes:
1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual words and excludes spaces etc.
I have also changed NumFragments option to MaxFragments as there may not
be enough covers to display NumFragments.
Another change that I was thinking:
Right now if cover size > max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would like to know what your opinion
on this is.
-Sushant.
On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote:
> > A couple of caveats:
> >
> > 1. ts_headline testing was done with current cvs head where as
> > headline_with_fragments was done with postgres 8.3.1.
> > 2. For headline_with_fragments, TSVector for the document was obtained
> > by joining with another table.
> > Are these differences understandable?
>
> That is possible situation because ts_headline has several criterias of 'best'
> covers - length, number of words from query, good words at the begin and at the
> end of headline while your fragment's algorithm takes care only on total number
> of words in all covers. It's not very good, but it's acceptable, I think.
> Headline (and ranking too) hasn't any formal rules to define is it good or bad?
> Just a people's opinions.
>
> Next possible reason: original algorithm had a look on all covers trying to find
> the best one while your algorithm tries to find just the shortest covers to fill
> a headline.
>
> But it's very desirable to use ShortWord - it's not very comfortable for user if
> one option produces unobvious side effect with another one.
> `
>
> > If you think these caveats are the reasons or there is something I am
> > missing, then I can repeat the entire experiments with exactly the same
> > conditions.
>
> Interesting for me test is a comparing hlCover with Cover in your patch, i.e.
> develop a patch which uses hlCover instead of Cover and compare old patch with
> new one.
Attachment | Content-Type | Size |
---|---|---|
headlines_v0.5.patch | text/x-patch | 11.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-06-21 16:56:13 | Re: System in Recovery Mode But No Activity |
Previous Message | kevin kempter | 2008-06-21 13:06:35 | function question |
From | Date | Subject | |
---|---|---|---|
Next Message | Stefan Kaltenbrunner | 2008-06-21 14:08:43 | Re: -head build error report |
Previous Message | Joshua D. Drake | 2008-06-21 13:30:10 | Re: -head build error report |