Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date: 2007-11-09 12:28:38
Message-ID: 47345276.5060803@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

> dictionaries. In this case, you would first check against one stopword
> list, eliminating 'od', then check the ispell dictionary, and then check
> another stopword list without 'od'.

My problem is basically solved using the patch I sent earlier. I use
'{stop, pl_ispell, simple}' which has the effect of:
a) eliminating words that are stopwords but stemmed produce
non-stopwords (such as 'od', that gets stemmed to 'oda')
b) stemming non-stopwords properly (using an ispell dictionary)
c) indexing words that are not reckognized by ispell, (for instance
'postgresql' gets indexed as 'postgresql')

> I suggested that a while ago
> (http://archives.postgresql.org/pgsql-hackers/2007-08/msg01036.php)
> Hopefully Oleg or someone else gets around restructuring the
> dictionaries in a future release.

I'm gald to see I'm not the only one who is in need of a more
sophisticated way of dealing with dictionaries chaining. I understand
however the problems that arise when one wants to extend the dictionary
API beyond the reject/accept/pass-on schema. For these three we have an
easy way of passing the result from lexize - it returns an empty array,
an array of stemmed lexemes or NULL. If more complex actions were to be
taken, I'm afraid lexize would have to return something more complex
than just text[].

> I wonder if you could hack the ispell dictionary file to treat oda
> specially?

I thought about it, but it turned out that writing a custom dictionary
was easier than figuring out how ispell works internally.

Regards,
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2007-11-09 12:50:02 Re: Free Space Map thoughts
Previous Message Alvaro Herrera 2007-11-09 12:25:10 Re: New tzdata available

Browse pgsql-patches by date

  From Date Subject
Next Message Magnus Hagander 2007-11-09 13:24:16 Re: krb_match_realm
Previous Message Heikki Linnakangas 2007-11-09 12:01:04 Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords