From: | Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> |
---|---|
To: | tgl(at)sss(dot)pgh(dot)pa(dot)us |
Cc: | ishii(at)sraoss(dot)co(dot)jp, euler(at)timbira(dot)com, teodor(at)sigaev(dot)ru, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [Fwd: Re: tsearch in core patch] |
Date: | 2007-06-25 04:40:59 |
Message-ID: | 20070625.134059.26277531.t-ishii@sraoss.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> writes:
> > Ok, probably we need to copy the English stemming rule to the one for
> > Japanese.
>
> Pardon my ignorance here, but is the concept of stemming even relevant
> to Japanese/Chinese/Korean? What little I know about ideographic
> languages suggests it wouldn't work well. And surely the specific rules
> in the Snowball project's English stemmer wouldn't work.
Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.
What I meant was the "chunks of English text" in Japanese.
> > I think same thing (commonly used English with local
> > language) can be applied to Chinese and Korean.
>
> Well, it's not hard at all to find chunks of English text that have
> embedded bits of French, Spanish, or what-have-you, but that's not an
> argument for trying to intermix the stemmers. I doubt that such simple
> bits of program could tell the language difference well enough to
> determine which stemming rules to apply.
For Japanese, it will be fairly simple: 7bit ASCII range words must be
English (Note that mostly used Japanese encodings such as EUC do not
allow to mix with ISO 8859).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paesold | 2007-06-25 06:07:26 | Re: msvc and vista fun |
Previous Message | Tom Lane | 2007-06-25 04:26:04 | Re: [Fwd: Re: tsearch in core patch] |