From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | andreas(at)visena(dot)com |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: pg full text search very slow for Chinese characters |
Date: | 2019-09-11 02:34:17 |
Message-ID: | 20190911.113417.69552735.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi.
At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas(at)visena(dot)com> wrote in <VisenaEmail(dot)3(dot)8750116fce15432e(dot)16d1c0b2b28(at)tc7-visena>
> På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us
> <mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us>>: Jimmy Huang <jimmy_huang(at)live(dot)com> writes:
> > I tried pg_trgm and my own customized token parser
> https://github.com/huangjimmy/pg_cjk_parser
>
> pg_trgm is going to be fairly useless for indexing text that's mostly
> multibyte characters, since its unit of indexable data is just 3 bytes
> (not characters). I don't know of any comparable issue in the core
> tsvector logic, though. The numbers you're quoting do sound quite awful,
> but I share Cory's suspicion that it's something about your setup rather
> than an inherent Postgres issue.
>
> regards, tom lane We experienced quite awful performance when we hosted the
> DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
> limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
> help tracing down IO-problems. --
> Andreas Joseph Krogh
Multibyte characters also quickly bloats index by many many small
buckets for every 3-characters combination of thouhsand of
characters, which makes it useless.
pg_bigm based on bigram/2-gram works better on multibyte
characters.
https://pgbigm.osdn.jp/index_en.html
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Nicola Contu | 2019-09-11 07:47:38 | ERROR: too many dynamic shared memory segments |
Previous Message | Adrian Klaver | 2019-09-10 19:26:45 | Re: kind of a bag of attributes in a DB . . . |