| From: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
|---|---|
| To: | tgl(at)sss(dot)pgh(dot)pa(dot)us |
| Cc: | oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Latin vs non-Latin words in text search parsing |
| Date: | 2007-10-23 22:45:58 |
| Message-ID: | 20071024.074558.51700790.t-ishii@sraoss.co.jp |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Just for clarification.
Are you going to make these changes in the 8.3 beta test period?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
> If I am reading the state machine in wparser_def.c correctly, the
> three classifications of words that the default parser knows are
>
> lword Composed entirely of ASCII letters
> nlword Composed entirely of non-ASCII letters
> (where "letter" is defined by iswalpha())
> word Entirely alphanumeric (per iswalnum()), but not above
> cases
>
> This classification is probably sane enough for dealing with mixed
> Russian/English text --- IIUC, Russian words will come entirely from
> the Cyrillic alphabet which has no overlap with ASCII letters. But
> I'm thinking it'll be quite inconvenient for other European languages
> whose alphabets include the base ASCII letters plus other stuff such
> as accented letters. They will have a lot of words that fall into
> the catchall "word" category, which will mean they have to index
> mixed alpha-and-number words in order to catch all native words.
>
> ISTM that perhaps a more generally useful definition would be
>
> lword Only ASCII letters
> nlword Entirely letters per iswalpha(), but not lword
> word Entirely alphanumeric per iswalnum(), but not nlword
> (hence, includes at least one digit)
>
> However, I am no linguist and maybe I'm missing something.
>
> Comments?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Fetter | 2007-10-23 22:47:50 | Re: Feature Freeze date for 8.4 |
| Previous Message | Josh Berkus | 2007-10-23 22:23:19 | Re: Feature Freeze date for 8.4 |