From: | Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> |
---|---|
To: | teodor(at)sigaev(dot)ru |
Cc: | ishii(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: tsearch2: enable non ascii stop words with C locale |
Date: | 2007-02-12 23:23:14 |
Message-ID: | 20070213.082314.74752487.t-ishii@sraoss.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> > Currently tsearch2 does not accept non ascii stop words if locale is
> > C. Included patches should fix the problem. Patches against PostgreSQL
> > 8.2.3.
>
> I'm not sure about correctness of patch's description.
>
> First, p_islatin() function is used only in words/lexemes parser, not stop-word
> code.
I know. My guess is the parser does not read the stop word file at
least with default configuration.
> Second, p_islatin() function is used for catching lexemes like URL or HTML
> entities, so, it's important to define real latin characters. And it works
> right: it calls p_isalpha (already patched for your case), then it calls
> p_isascii which should be correct for any encodings with C-locale.
original p_islatin is defined as follows:
static int
p_islatin(TParser * prs)
{
return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0;
}
So if a character is not ASCII, it returns 0 even if p_isalpha returns
1. Is this what you expect?
> Third (and last):
> contrib_regression=# show server_encoding;
> server_encoding
> -----------------
> UTF8
> contrib_regression=# show lc_ctype;
> lc_ctype
> ----------
> C
> contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
> lexize
> --------
> {}
>
> Russian characters with UTF8 take two bytes.
In our case, we added JAPANESE_STOP_WORD into english.stop then:
select to_tsvector(JAPANESE_STOP_WORD)
which returns words even they are in JAPANESE_STOP_WORD.
And with the patches the problem was solved.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
From | Date | Subject | |
---|---|---|---|
Next Message | Jeremy Drake | 2007-02-12 23:31:50 | Re: pgsql: Fix backend crash in parsing incorrect tsquery. |
Previous Message | Tommy Gildseth | 2007-02-12 23:15:59 | Re: XML export function signatures |