Re: contrib/tsearch

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc: Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)stack(dot)net>
Subject: Re: contrib/tsearch
Date: 2002-09-06 10:41:01
Message-ID: Pine.GSO.4.44.0209061312560.13637-100000@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 6 Sep 2002, Christopher Kings-Lynne wrote:

> There also seems to be a more complete list of english stopwords here:
>
> http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/

Chris, I think we have to separate stop word list from tsearch package and
supply just some defaults. The reason for this is to let user decide what is
a stop word - various domains should have different stop words.
This is how OpenFTS works.
Also, we probably need to let user decide when to check for stop word -
after or before stemming. I'm waiting for Martin's fix for english stemmerr
and probably we'll switch to use snowball one, which are more qualified.

Damn, we wanted to do these and much more a bit later because we're under
big pressure of our work. We'll see if we could manage our plans.

We certainly need developers to help us in full text searching,
ltree ( it has a chance to support XML ). Also we need to work
on adding concurrency support to GiST.

so, I couldn't promise we'll work on tsearch right now, but we provide
makedict.pl so you could build dictionary with custom list of stop words.
Did you try it ?

>
> However this list again does not include contractions. I can take this
> list, check it and submit it to you Oleg, but do you want me to add
> contractions?
>
> eg. wasn't, isn't, it's, etc.?

Hmm, our parser isn't smart to handle them as a single word, so
it'll not helps:

13:30:03[megera(at)amon]~/app/fts/test-suite>./testdict.pl -p
wasn't
lexeme:wasn:1:Latin word
lexeme:':12:Space symbols
lexeme:t:1:Latin word

But, you always could add 'wasn', 'isn' ... and 't','s' to list of your
stop words and be happy. Hmm, probably we could enhance our parser to
handle such words too.

Anyway, most problems just a question of time we don't have :-(

>
> Chris
>
> > -----Original Message-----
> > From: pgsql-hackers-owner(at)postgresql(dot)org
> > [mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Christopher
> > Kings-Lynne
> > Sent: Friday, 6 September 2002 12:20 PM
> > To: Christopher Kings-Lynne; Oleg Bartunov
> > Cc: Hackers; martin_porter(at)softhome(dot)net
> > Subject: Re: [HACKERS] contrib/tsearch
> >
> >
> > > Looking at the list of stopwords you sent me, Oleg, there are
> > only about 1
> > > out of the list of 120 stopwords that need to have all word forms
> > > added. I
> > > also don't think it'll be a maintenance problem. The reason I
> > > think this is
> > > because stopwords in general don't have different word forms.
> >
> > Actually, it just occurred to me that stuff like:
> >
> > will
> > won't
> > it
> > it's
> > where
> > where's
> >
> > Will all have to be in the list, right?
> >
> > Chris
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> > message can get through to the mailing list cleanly
> >
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2002-09-06 10:46:00 Re: contrib/tsearch
Previous Message Hannu Krosing 2002-09-06 08:21:52 Re: Inheritance