Quick Links

english parser in text search: support for multiple words in the same position

From:	Sushant Sinha <sushant354(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	english parser in text search: support for multiple words in the same position
Date:	2010-08-01 18:04:36
Message-ID:	1280685876.1754.43.camel@dragflick
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Currently the english parser in text search does not support multiple
words in the same position. Consider a word "wikipedia.org". The text
search would return a single token "wikipedia.org". However if someone
searches for "wikipedia org" then there will not be a match. There are
two problems here:

1. We do not have separate tokens "wikipedia" and "org"
2. If we have the two tokens we should have them at adjacent position so
that a phrase search for "wikipedia org" should work.

It will be nice to have the following tokenization and positioning for
"wikipedia.org"

position 0: WORD(wikipedia), URL(wikipedia.org)
position 1: WORD(org)

Take the example of "wikipedia.org/search?q=sushant"

Here is the TSVECTOR:

select to_tsvector('english', 'wikipedia.org/search?q=sushant');

to_tsvector
----------------------------------------------------------------------------
'/search?q=sushant':3 'wikipedia.org':2
'wikipedia.org/search?q=sushant':1

And here are the tokens:

select ts_debug('english', 'wikipedia.org/search?q=sushant');

ts_debug
--------------------------------------------------------------------------------
(url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q
=sushant})
(host,Host,wikipedia.org,{simple},simple,{wikipedia.org})
(url_path,"URL
path",/search?q=sushant,{simple},simple,{/search?q=sushant})

The tokenization I would like to see is:

position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
position 1: WORD(org)
position 2: WORD(search), URL_PATH(search/?q=sushant)
position 3: WORD(q), URL_QUERY(q=search)
position 4: WORD(sushant)

So what we need is to support multiple tokens at the same position. And
I need help in understanding how to realize this. Currently the position
assignment happens in make_tsvector by working or parsed lexemes. The
lexeme is obtained by prsd_nexttoken.

However, prsd_nexttoken only returns a single token. Will it be possiblt
to store some tokens and return them tokegher? Or can we put a flag to
certain tokens that say the position should not be increased?

-Sushant.

Responses

Re: english parser in text search: support for multiple words in the same position at 2010-08-02 07:36:24 from Markus Wanner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-08-01 18:43:29	Re: review: psql: edit function, show function commands patch
Previous Message	Robert Haas	2010-08-01 17:28:23	Re: review: psql: edit function, show function commands patch