Quick Links

Re: pg_trgm

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, ishii(at)sraoss(dot)co(dot)jp, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, pgsql-hackers(at)postgresql(dot)org, teodor(at)sigaev(dot)ru
Subject:	Re: pg_trgm
Date:	2010-05-27 19:00:22
Message-ID:	AANLkTimowXqtPBQl4Qhsj2LlxLhIbvMQUuI4G8cB42Eh@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, May 27, 2010 at 2:01 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:
>> > I don't know about Japanese, but the locale approach works just fine for
>> > other agglutinative languages. I would rather suspect that it is the
>> > trigram approach that might be rather useless for such languages,
>> > because you are going to get a lot of similarity hits for the affixes.
>>
>> I'm not sure what you mean by "affixes". But I will explain...
>>
>> A Japanese sentence consists of words. Problem is, each word is not
>> separated by space (agglutinative). So most text tools such as text
>> search need preprocess which finds word boundaries by looking up
>> dictionaries (and smart grammer analysis routine). In the process
>> "affixes" can be determined and perhaps removed from the target word
>> group to be used for text search (note that removing affixes is no
>> relevant to locale). Once we get space separated sentence, it can be
>> processed by text search or by pg_trgm just same as Engligh. (Note
>> that these preprocessing are done outside PostgreSQL world). The
>> difference is just the "word" can be consists of non ASCII letters.
>
> I think the problem at hand has nothing at all to do with agglutination
> or CJK-specific issues. You will get the same problem with other
> languages *if* you set a locale that does not adequately support the
> characters in use. E.g., Russian with locale C and encoding UTF8:
>
> select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
> \u043D\u044B');
> similarity
> ────────────
> NaN
> (1 row)

What I can't help wondering as I'm reading this discussion is -
Tatsuo-san said upthread that he has a problem with pg_trgm that he
does not have with full text search. So what is full text search
doing differently than pg_trgm?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Re: pg_trgm at 2010-05-27 18:01:01 from Peter Eisentraut

Responses

Re: pg_trgm at 2010-05-27 23:54:59 from Tatsuo Ishii

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2010-05-27 19:01:39	Re: functional call named notation clashes with SQL feature
Previous Message	David E. Wheeler	2010-05-27 18:59:34	Re: functional call named notation clashes with SQL feature