From: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> |
---|---|
To: | pgsql-patches(at)postgresql(dot)org |
Subject: | a tsearch2 (8.2.4) dictionary that only filters out stopwords |
Date: | 2007-11-09 01:22:34 |
Message-ID: | 4733B65A.9030707@students.mimuw.edu.pl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
Hi,
the rationale for this patch is rather complicated, as it's related to
the peculiarities of Polish grammar. Please read on.
I'm using PostgreSQL 8.2.4 and the ispell tsearch2 dictionary. The
problem is as follows. In Polish (and possibly other languages that
don't come to my mind at the moment) a noun can take different forms
depending on the grammatical context. This is called declension. For
exmple the noun 'oda' (which means 'ode' in English) can take the form
'od' in certain cases. However, the word in Polish 'od' is also a
preposition. The problem with the ispell dictionary is that it first
reduces a lexeme to it's stem and then checks whether it is or is not a
stopword.
This means that I either have to agree with the fact that the tsvectors
for my documents will contain large numbers of the noun 'oda' (because
each time a preposition 'od' is used in the text it will be stemmed to
produce 'oda' and then indexed) or I have to include the word 'oda' in
the stopwords file and thus eliminate a perfectly good noun from my
tsvectors.
The solution I came up with was simple: write a dictionary, that does
only one thing: looks up the lexeme in a stopwords file and either
discards it or returns NULL. That way I could use it as the first
dictionary is the dictionary stach for lexeme types I'm interested in
and it would discard every instance of 'od', while passing every
non-stopword (in particular 'oda') to the ispell dictionary.
Tha attached patch adds a dictionary called stop to the set of standard
dictionaries that one gets after installing tsearch2. The C code may not
be first-class (however it works for me in a real business solution) -
it's quite trivial and I'd be happy if some more experienced Postgres
hackers would implement the idea in a cleaner/safer way. It's been
tested on 8.2.4 and compiles on 8.2.5. I haven't even looked at the code
for 8.3 yet, but maybe the change could somehow make it's way into the
integrated full text search?
Regards,
Jan Urbanski
Warsaw University
http://fiok.pl/
--
Jan Urbanski
GPG key ID: E583D7D2
ouden estin
Attachment | Content-Type | Size |
---|---|---|
tsearch-stopsieve.patch | text/plain | 3.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Trevor Talbot | 2007-11-09 01:46:08 | Re: New tzdata available |
Previous Message | Alvaro Herrera | 2007-11-09 00:50:13 | Re: Free Space Map thoughts |
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2007-11-09 02:32:09 | Fix for stop words in thesaurus file |
Previous Message | Bruce Momjian | 2007-11-09 00:51:36 | Re: Contrib docs v1 |