Re: Searching for substring with tsearch(1/2)

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Urmo <urmo(at)xwm(dot)ee>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Searching for substring with tsearch(1/2)
Date: 2003-12-09 21:07:49
Message-ID: 3FD639A5.9080902@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Urmo wrote:
> Hi,
>
> there seems to be no way of searching partial matches with tsearch.
> Would it be hard to implement prefix based matching, i.e.
> "hu" matches "human", "humanity", "humming", "huge"? With some hacking I
> managed to disable morphology part from tsearch1 (database contained
> multiple languages in a single table so morphology could not be used)
> and it run happily for a year. But now I needed prefix based substring
> match and I'm kinda lost. I tried using fulltextindex but it took ages
> to create the index with 100K table (query run about a day before I lost
> my pacience and canceled it). I even modified it to lose suffixes and
> index only full words but it was still too slow (50K records were
> indexed in 24h).
>
> Can anybody help or point out right direction? Or is this even (easily)
> doable with tsearch1 or tsearch2?

Tsearch was never minded as prefix search, and index structure doesn't support
any kind of prefix or suffix. But you can write extension to tsearch, which will
search by prefix. But such solution wiil not use index, only sequence scan.
:
Prefix searches easy realized with inverted index, but it require a lot of
programing.
The simplest way is:
create table invidx (
lexeme text not null primary key,
ids[] int
);

where ids[] - array with identificators of documents which contains this word.
So, your custom software may look as follow:
create function getids_by_word(text) returns setof int as ..........;

This function should returns all identificators of docs which contains word with
prefix in argument. So result query will be:
select * from docs where docs.id in (select * from getids_by_word());

Whats good:
1 efficience of word search
2 Reuse some code from tsearch :) - word parser

Whats bad:
1 update - too slow, some more efficient way is a bulk upload.
2 If word is frequent then query with 'IN (select * from func()) may works slow...

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2003-12-09 21:21:30 Re: Searching for substring with tsearch(1/2)
Previous Message Sailesh Krishnamurthy 2003-12-09 20:31:06 Re: PostgreSQL port to pure Java?