| From: | Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl> |
|---|---|
| To: | pgsql-general General <pgsql-general(at)postgresql(dot)org> |
| Subject: | Using tsearch2 in a Bayesian filter |
| Date: | 2008-04-06 11:13:18 |
| Message-ID: | A2D9C6F9-4394-4871-A882-3348D45CCBFC@solfertje.student.utwente.nl |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Hi all,
In my spare time I've started on a general purpose Bayesian filter
based on the now built-in tsearch2 functionality. The ability to stem
words from a message into lexemes, removing stop words and gist
indexes look promising enough to attempt this. However, my experience
with tsearch is somewhat limited, so I have a few questions...
The messages entering the filter will be in different languages and
encoding. For example, I get a lot of Cyrillic spam these days, while
I get a lot of English messages and a few in Dutch. Especially the
spam is likely to lie about it's encoding. Some messages will be
plain text, but many will be HTML.
- Is it possible to stem words from that wide a variety of content?
- If so, what approach would be best?
- Do I need to strip out the HTML tags or can they serve as lexemes
themselves?
Next, to determine the probability of a lexeme being of a certain
classification (for example spam or not spam), I need to be able to
count the number of occurrences of that lexeme in a text. I can't
store a probability, as the numbers aren't fixed[*] (was hoping to
abuse score() here, but that's probably a no-op). I haven't found any
tsearch functions to determine the number of occurrences of each
lexeme in a text. Ideally I'd have a resultset with ( lexeme, number
of occurrences) tuples, so that I can use that directly in a query.
- How do I determine the number of occurrences of each lexeme in a text?
Thanks for your time.
[*] As more messages enter the system, there will be more occurrences
of lexemes in messages and in classifications. If I start out with
one lexeme occurring once in a single message, the chance that lexeme
is in a message is 1. As soon as another message arrives not
containing that lexeme, the chance is 0.5. The number of messages,
occurrence of lexemes in messages and classifications is a
continuously moving number, so I will need the numbers the
probability was based on (might still decide to add a column with the
probability calculated from those numbers for speed, of course).
Regards,
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.
!DSPAM:737,47f8b050927661534911704!
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Volkan YAZICI | 2008-04-06 14:18:21 | Re: Numbering rows by date |
| Previous Message | Harald Fuchs | 2008-04-06 09:06:14 | Re: Numbering rows by date |