From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Heikki Linnakangas <heikki(at)enterprisedb(dot)com> |
Cc: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: integrated tsearch doesn't work with non utf8 database |
Date: | 2007-09-08 06:19:38 |
Message-ID: | Pine.LNX.4.64.0709081017260.2767@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 7 Sep 2007, Heikki Linnakangas wrote:
> Pavel Stehule wrote:
>> postgres=# select ts_debug('cs','PЪЪЪЪliЪЪ ЪЪluЪЪouЪЪkЪЪ kЪЪЪЪ se napil ЪЪlutЪЪ vody');
>> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
>> CONTEXT: SQL function "ts_debug" statement 1
>
> I can reproduce that. In fact, you don't need the custom config or
> dictionary at all:
>
> postgres=# CREATE DATABASE latin2 encoding='latin2';
> CREATE DATABASE
> postgres=# \c latin2
> You are now connected to database "latin2".
> latin2=# select ts_debug('simple','foo');
> ERROR: character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT: SQL function "ts_debug" statement 1
>
> It fails trying to lexize the string using the danish snowball stemmer,
> because the danish stopword file contains character 'ЪЪ' which doesn't
> have an equivalent in LATIN2.
>
> Now what the heck is it doing with the danish stemmer, you might ask.
> ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
> behind it, I get this plan:
>
> latin2=# \i foo.sql
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=2.80..1134.45 rows=80 width=100)
> Hash Cond: (parse.tokid = tt.tokid)
> InitPlan
> -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4)
> Filter: (oid = 3748::oid)
> -> Seq Scan on pg_ts_config (cost=0.00..1.20 rows=1 width=4)
> Filter: (oid = 3748::oid)
> -> Function Scan on ts_parse parse (cost=0.00..12.50 rows=1000
> width=36)
> -> Hash (cost=0.20..0.20 rows=16 width=68)
> -> Function Scan on ts_token_type tt (cost=0.00..0.20 rows=16
> width=68)
> SubPlan
> -> Limit (cost=7.33..7.36 rows=1 width=36)
> -> Subquery Scan dl (cost=7.33..7.36 rows=1 width=36)
> -> Sort (cost=7.33..7.34 rows=1 width=8)
> Sort Key: m.mapseqno
> -> Seq Scan on pg_ts_config_map m
> (cost=0.00..7.32 rows=1 width=8)
> Filter: ((ts_lexize(mapdict, $1) IS NOT
> NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))
> -> Sort (cost=6.57..6.57 rows=1 width=8)
> Sort Key: m.mapseqno
> -> Seq Scan on pg_ts_config_map m (cost=0.00..6.56 rows=1
> width=8)
> Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
> (21 rows)
>
> Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
> $1). That means that it will call ts_lexize on every dictionary, which
> will try to load every dictionary. And loading danish_stem dictionary
> fails in latin2 encoding, because of the problem with the stopword file.
>
> We could rewrite ts_debug as a C-function, so that it doesn't try to
ts_debug currently doesn't work well with thesaurus dictionary, so it
certainly needs to be rewritten in C. We left rewriting it for future.
> access any unnecessary dictionaries. It seems wrong to install
> dictionaries in databases where they won't work in the first place, but
> I don't see an easy fix for that. Any comments or better ideas?
>
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | apoc9009 | 2007-09-08 08:39:19 | Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC) |
Previous Message | Tom Lane | 2007-09-08 03:34:32 | Re: apparent tsearch breakage on 64-bit machines |