From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Laimonas Simutis <laimis(at)gmail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: processing urls with tsearch2 |
Date: | 2007-09-13 19:02:40 |
Message-ID: | Pine.LNX.4.64.0709132258040.2767@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Thu, 13 Sep 2007, Laimonas Simutis wrote:
> Hey guys,
>
> maybe anyone using tsearch2 could advise on this. With the default
> installation, url, host and some other tokens are processed with the simple
> dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'. The
> parser correctly assigns token id of type host to the term, but then the
> dictionary the terms gets routed through is simple and what gets stored is
> mywebsite.com
>
> The questions are:
>
> 1) is there a dictionary available that I could utilize that will remove
> .com, .net, .org, etc? I could write one myself, but after seeing some
> sample dictionary implementations and C code I try to avoid, I got scared a
> bit.
Yes, we have dict_regex, which was developed by Sergey Karpov, see details
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html
It uses pcre library and you need to know perl regexps.
>
> 2) has anyone else dealt with this maybe in a different way?
sure, preprocess text using prefered language before passing to ro_tsvector
>
>
> Thanks for any suggestions and help,
>
> Laimis
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Erik Jones | 2007-09-13 19:05:30 | Re: pg_standby observation |
Previous Message | Nikolay Samokhvalov | 2007-09-13 18:52:56 | PostgreSQL Glossary? |