From: | Reece Hart <reece(at)harts(dot)net> |
---|---|
To: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Cc: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: tsearch2 and hyphenated terms |
Date: | 2008-04-12 00:31:15 |
Message-ID: | 1207960275.7053.86.camel@snafu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, 2008-04-11 at 22:07 +0400, Oleg Bartunov wrote:
> We have the same problem with names in astronomy, so we implemented
> dict_regex http://vo.astronet.ru/arxiv/dict_regex.html
> Check it out !
Oleg-
This gets me a lot closer. Thank you. I have two remaining problems.
The first problem is that 'bcl-w' and 'bcl-2' are parsed differently,
like so:
unison(at)u8(dot)3=> select * from ts_debug('english','bcl-w');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------+----------------+--------------+---------
asciihword | Hyphenated word, all ASCII | bcl-w | {english_stem} | english_stem | {bcl-w}
hword_asciipart | Hyphenated word part, all ASCII | bcl | {english_stem} | english_stem | {bcl}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | w | {english_stem} | english_stem | {w}
unison(at)u8(dot)3=> select * from ts_debug('english','bcl-2');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | bcl | {english_stem} | english_stem | {bcl}
int | Signed integer | -2 | {simple} | simple | {-2}
One option would be to write a new parser/modify wparser_def.c to make
the InHyphyenWordFirst accept p_isdigit or p_isalnum on the first
character (I think I got this right). This would achieve Tom's initial
inkling that Bcl-2 might be parsed as a numhword and (to me) it seems
more congruent with asciihword class.
Perhaps a more broadly useful modification is for the lexer to also emit
whitespace-delimited tokens (period). asciihword almost does the trick,
but it too requires a post-hyphen alphabetic character.
The second problem is with quantifiers on PCRE's regexps. I initially
implemented a dict_regex with a conf line like
(\w+)-(\w{1,2}) $1$2
I can make simpler expressions work (eg., (bcl)-(\w)). I think it must
be related to the README caveat regarding PCRE partial matching mode,
which I didn't understand initially.
However, I don't see that it's possible to write a general regexp like
the one I initially tried. Do you have any suggestions?
Thanks again. I'm very impressed with tsearch2.
-Reece
--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
From | Date | Subject | |
---|---|---|---|
Next Message | Jaisen N.D. | 2008-04-12 05:11:58 | Problem. createdb: could not connect to database postgres: could not connect to server: No such file or directory |
Previous Message | Tom Lane | 2008-04-11 23:12:46 | Re: Deleting row in 7.4 takes for ever |