Quick Links

Re: TSearch2 / Get all unique lexems

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Hannes Dorbath <light(at)theendofthetunnel(dot)de>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, pgsql-general(at)postgresql(dot)org
Subject:	Re: TSearch2 / Get all unique lexems
Date:	2005-12-08 11:00:55
Message-ID:	43981267.6090802@sigaev.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

> Thanks. I hoped for something possible inside a pl/pgsql proc. I'm
> trying to integrate pg_trgm with Tsearch2. I'm still on my UTF-8
> database. Yes I know, there is _NO_ UTF-8 support of any kind in
> Tsearch2 yet, but I got it working to a degree that is OK for my
> application (Created my own stemmer variant, ispell dict, affix file
> etc). The last missing bit is to get a source for pg_trgm. I cannot use
> the the stat() function, because it breaks as soon it sees an UTF-8 char.

I suppose noncompatible with UTF wordparser can produce illegal lexemes (with
part of multibyte char) and stores it in tsvector. Tsvector hasn't any control
of breakness lexemes (with a help pg_verifymbstr() call), but stat() makes text
field and then postgres check it and found incomplete mbchars. Which way I see
(except waiting UTF support in tsearch2 which we develop now):

1 modify stat() function to check text field and if it fails then remove lexeme
from output

2 Take from CVS HEAD wordpaser (ts_locale.[ch], wparser_def.c,
wordparser/parser.[ch]). to_tsvector will works fine, to_tsquery will works
correct only with quoted string (for examle, 'foo' & 'bar', bad: foo & bar).
But casting 'asasas'::tsvector and dump/reload will not work correct.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

In response to

Re: TSearch2 / Get all unique lexems at 2005-12-08 08:50:28 from Hannes Dorbath

Browse pgsql-general by date

	From	Date	Subject
Next Message	Oleg Bartunov	2005-12-08 11:04:03	Re: TSearch2 / Get all unique lexems
Previous Message	Teodor Sigaev	2005-12-08 10:33:11	Re: fts, compond words?