From: | Hannu Krosing <hannu(at)tm(dot)ee> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Urmo <urmo(at)xwm(dot)ee>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Searching for substring with tsearch(1/2) |
Date: | 2003-12-09 22:42:51 |
Message-ID: | 1071009771.3194.34.camel@fuji.krosing.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Teodor Sigaev kirjutas T, 09.12.2003 kell 23:07:
> Urmo wrote:
> > Hi,
> >
> > there seems to be no way of searching partial matches with tsearch.
> > Would it be hard to implement prefix based matching, i.e.
> > "hu" matches "human", "humanity", "humming", "huge"? With some hacking I
> > managed to disable morphology part from tsearch1 (database contained
> > multiple languages in a single table so morphology could not be used)
> > and it run happily for a year. But now I needed prefix based substring
> > match and I'm kinda lost. I tried using fulltextindex but it took ages
> > to create the index with 100K table (query run about a day before I lost
> > my pacience and canceled it). I even modified it to lose suffixes and
> > index only full words but it was still too slow (50K records were
> > indexed in 24h).
> >
> > Can anybody help or point out right direction? Or is this even (easily)
> > doable with tsearch1 or tsearch2?
I have done it outside PostgreSQL (using BSDDB 1.85 and python, about
7-8 years ago) in a manner very much like Teodor describes below.
It was the original web search system for our leading newspaper, Eesti
Päevaleht. It worked without any maintenance for 3-4 years, even after
trashing parts of index due to other processes filled up the disk a few
times, after which it did not always return all the older results ;)
bulk updates for a days worth of articles (50-70) took just a few
minutes, search was about one second including starting up python cgi,
which usually took most of that second .
> Tsearch was never minded as prefix search, and index structure doesn't support
> any kind of prefix or suffix. But you can write extension to tsearch, which will
> search by prefix. But such solution wiil not use index, only sequence scan.
How efficient would tsearch be for really big expressions (where 'hu%'
would be expanded (using a btree word index on one column word table) to
tsearch equivalent of ( "human" or "humanity" or "humming" or "huge" or
..1000 words here...) before passing the expression to tsearch?
> :
> Prefix searches easy realized with inverted index, but it require a lot of
> programing.
> The simplest way is:
> create table invidx (
> lexeme text not null primary key,
> ids[] int
> );
>
> where ids[] - array with identificators of documents which contains this word.
How hard (or sensible ;) would be creating such an index using GiST ?
As proved by tsearch GiST can cope well with many-to-many indexes.
> So, your custom software may look as follow:
> create function getids_by_word(text) returns setof int as ..........;
>
> This function should returns all identificators of docs which contains word with
> prefix in argument. So result query will be:
> select * from docs where docs.id in (select * from getids_by_word());
>
> Whats good:
> 1 efficience of word search
> 2 Reuse some code from tsearch :) - word parser
>
> Whats bad:
> 1 update - too slow, some more efficient way is a bulk upload.
or some hybrid of bulk and update, perhaps with table structure like
create table invidx (
lexeme text not null,
textdate date not null,
ids[] int,
primary key (lexeme, textdate)
);
which would partition the invidx table on textdate (or some other
suitable datum)
> 2 If word is frequent then query with 'IN (select * from func()) may works slow...
if it is often too slow then creating a temp table and doing a plain
join may be faster.
-------------
Hannu
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Treat | 2003-12-09 22:43:49 | Re: PostgreSQL port to pure Java? |
Previous Message | Larry Rosenman | 2003-12-09 22:20:09 | Re: [COMMITTERS] pgsql-server/ oc/src/sgml/Tag: |