Quick Links

Re: tsearch2: enable non ascii stop words with C locale

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsearch2: enable non ascii stop words with C locale
Date:	2007-02-12 14:55:11
Message-ID:	45D07FCF.7020407@sigaev.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> Currently tsearch2 does not accept non ascii stop words if locale is
> C. Included patches should fix the problem. Patches against PostgreSQL
> 8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word
code. Second, p_islatin() function is used for catching lexemes like URL or HTML
entities, so, it's important to define real latin characters. And it works
right: it calls p_isalpha (already patched for your case), then it calls
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding;
server_encoding
-----------------
UTF8
contrib_regression=# show lc_ctype;
lc_ctype
----------
C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
lexize
--------
{}

Russian characters with UTF8 take two bytes.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

In response to

tsearch2: enable non ascii stop words with C locale at 2007-02-11 08:20:38 from Tatsuo Ishii

Responses

Re: tsearch2: enable non ascii stop words with C locale at 2007-02-12 23:23:14 from Tatsuo Ishii

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2007-02-12 15:29:20	DROP DATABASE and prepared xacts
Previous Message	mark	2007-02-12 14:36:07	Re: HOT for PostgreSQL 8.3