From: | jesper(at)krogh(dot)cc |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Configuring Text Search parser? |
Date: | 2010-09-20 14:01:08 |
Message-ID: | 1a26550c0b55c0a0af0dcbd8e080bc82.squirrel@shrek.krogh.cc |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi.
I'm trying to migrate an application off an existing Full Text Search engine
and onto PostgreSQL .. one of my main (remaining) headaches are the
fact that PostgreSQL treats _ as a seperation charachter whereas the existing
behaviour is to "not split". That means:
testdb=# select ts_debug('database_tag_number_999');
ts_debug
------------------------------------------------------------------------------
(asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
(blank,"Space symbols",_,{},,)
(asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
(blank,"Space symbols",_,{},,)
(asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
(blank,"Space symbols",_,{},,)
(uint,"Unsigned integer",999,{simple},simple,{999})
(7 rows)
Where the incoming data, by design contains a set of tags which includes _
and are expected to be one "lexeme".
I've tried patching my way out of this using this patch.
$ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
src/backend/tsearch/wparser_def.c
*** src/backend/tsearch/wparser_def.c.orig 2010-09-20 15:58:37.033336460
+0200
--- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200
***************
*** 967,986 ****
--- 967,988 ----
static const TParserStateActionItem actionTPS_InNumWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
{p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
{p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
+ {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
{p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
};
static const TParserStateActionItem actionTPS_InAsciiWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
{p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
+ {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
{p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
***************
*** 995,1004 ****
--- 997,1007 ----
static const TParserStateActionItem actionTPS_InWord[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
{p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
{p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
+ {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
{p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
{p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
};
This will obviously break other peoples applications, so my questions would
be: If this should be made configurable.. how should it be done?
As a sidenote... Xapian doesn't split on _ .. Lucene does.
Thanks.
--
Jesper
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2010-09-20 14:09:51 | Re: Serializable Snapshot Isolation |
Previous Message | Robert Haas | 2010-09-20 13:15:42 | Re: Configuring synchronous replication |