From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: tsearch2: enable non ascii stop words with C locale |
Date: | 2007-02-12 14:55:11 |
Message-ID: | 45D07FCF.7020407@sigaev.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.
I'm not sure about correctness of patch's description.
First, p_islatin() function is used only in words/lexemes parser, not stop-word
code. Second, p_islatin() function is used for catching lexemes like URL or HTML
entities, so, it's important to define real latin characters. And it works
right: it calls p_isalpha (already patched for your case), then it calls
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding;
server_encoding
-----------------
UTF8
contrib_regression=# show lc_ctype;
lc_ctype
----------
C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
lexize
--------
{}
Russian characters with UTF8 take two bytes.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2007-02-12 15:29:20 | DROP DATABASE and prepared xacts |
Previous Message | mark | 2007-02-12 14:36:07 | Re: HOT for PostgreSQL 8.3 |