From: | Sushant Sinha <sushant354(at)gmail(dot)com> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Pierre-Yves Strub <pierre(dot)yves(dot)strub(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [GENERAL] Fragments in tsearch2 headline |
Date: | 2008-05-31 23:58:41 |
Message-ID: | 1212278321.5891.24.camel@dragflick |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.
DESCRIPTION
HeadlineParsedText contains an array of actual words but not
information about the norms. We need an indexed position vector for each
norm so that we can quickly evaluate a number of possible fragments.
Something that tsvector provides.
So this patch changes HeadlineParsedText to contain the norms
(ParsedText). This field is updated while parsing in hlparsetext. The
position information of the norms corresponds to the position of words
in HeadlineParsedText (not to the norms positions as is the case in
tsvector). This works correctly with the current parser. If you think
there may be issues with other parsers please let me know.
This approach does not change any other interface and fits nicely with
the overall framework.
The norms are converted into tsvector and a number of covers are
generated. The best covers are then chosen to be in the headline. The
covers are separated using a hardcoded coversep. Let me know if you want
to expose this as an option.
Covers that overlap with already chosen covers are excluded.
Some options like ShortWord and MinWords are not taken care of right
now. MaxWords are used as maxcoversize. Let me know if you would like to
see other options for fragment generation as well.
Let me know any more changes you would like to see.
-Sushant.
On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote:
> Hi!
>
> > 1. Why is hlparsetext used to parse the document rather than the
> > parsetext function? Since words to be included in the headline will be
> > marked afterwords, it seems more reasonable to just use the parsetext
> > function.
> > The main difference I see is the use of hlfinditem and marking whether
> > some word is repeated.
> hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext
> doesn't.
> hlparsetext preserves original form of lexemes. parsetext doesn't.
>
> >
> > The reason this is important is that hlparsetext does not seem to be
> > storing word positions which parsetext does. The word positions are
> > important for generating headline with fragments.
> Doesn't needed - hlparsetext preserves the whole text, so, position is a number
> of array.
>
> >
> > 2.
> >> I would prefer the signature ts_headline( [regconfig,] text, tsquery
> >> [,text] )and function should accept 'NumFragments=>N' for default
> >> parser. Another parsers may use another options.
> >
> > Does this mean we want a unified function ts_headline and we trigger the
> > fragments if NumFragments is specified?
>
> Trigger should be inside parser-specific function (pg_ts_parser.prsheadline).
> Another parsers might not recognize that option.
>
> > It seems that introducing a new
> > function which can take configuration OID, or name is complex as there
> > are so many functions handling these issues in wparser.c.
> No, of course - ts_headline takes care about finding configuration and calling
> correct parser.
>
> >
> > If this is true then we need to just add marking of headline words in
> > prsd_headline. Otherwise we will need another prsd_headline_with_covers
> > function.
> Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change
> an array of HeadlineParsedText.
>
> >
> > 3. In many cases people may already have TSVector for a given document
> > (for search operation). Would it be faster to pass TSVector to headline
> > function when compared to computing TSVector each time? If that is the
> > case then should we have an option to pass TSVector to headline
> > function?
> As I mentioned above, tsvector doesn;t contain whole information about text.
>
Attachment | Content-Type | Size |
---|---|---|
headlines_v0.2.patch | text/x-patch | 18.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Martin | 2008-06-01 00:57:59 | Re: Converting empty input strings to Nulls |
Previous Message | Jeff Davis | 2008-05-31 18:54:25 | Re: Converting empty input strings to Nulls |
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2008-06-01 01:10:06 | synchronized scans for VACUUM |
Previous Message | Greg Sabino Mullane | 2008-05-31 23:32:48 | Re: Overhauling GUCS |