From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Joanna Sharman <Joanna(dot)Sharman(at)ed(dot)ac(dot)uk> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: HTML tags and tsearch2 |
Date: | 2008-06-26 12:05:09 |
Message-ID: | Pine.LNX.4.64.0806261602120.11363@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Thu, 26 Jun 2008, Joanna Sharman wrote:
> Hi,
>
> I have recently started experimenting with tsearch2 and it seems that the
> default behaviour is to ignore HTML tags and treat them as word-separators.
> What I would like it to do is to ignore HTML tags within words, but instead
> of creating separate words, combine the characters separated by the tag into
> one word.
>
> For example: in the database I have words like 'K<sub>ir</sub>' that need to
> be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML
> tags are ignored and two words are stored in the vector, 'k' and 'ir'. I
> would like only one word, 'kir', to be stored in the vector, so that searches
> using the word 'kir' will match the row.
2 options - write HTML parser and preprocess text before to_tsvector.
>
> A second, related question is whether it is possible to cause tsearch2 to
> split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.
you can write your own dictionary or use dict_regex from
http://vo.astronet.ru/arxiv/dict_regex.html
>
> I am not sure if this functionality is possible to implement using tsearch2
> or if there might be a better way, so I would be grateful for any advice or
> pointers to further reading on how I might do this. (I am using PostgreSQL
> version 8.1.10)
think about upgrading to 8.3
>
> Many thanks in advance,
> Joanna
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Phillip Mills | 2008-06-26 12:20:04 | Re: Serialized Access |
Previous Message | Joanna Sharman | 2008-06-26 11:11:58 | HTML tags and tsearch2 |