Quick Links

Latin vs non-Latin words in text search parsing

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Latin vs non-Latin words in text search parsing
Date:	2007-10-21 20:47:43
Message-ID:	29209.1192999663@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

If I am reading the state machine in wparser_def.c correctly, the
three classifications of words that the default parser knows are

lword Composed entirely of ASCII letters
nlword Composed entirely of non-ASCII letters
(where "letter" is defined by iswalpha())
word Entirely alphanumeric (per iswalnum()), but not above
cases

This classification is probably sane enough for dealing with mixed
Russian/English text --- IIUC, Russian words will come entirely from
the Cyrillic alphabet which has no overlap with ASCII letters. But
I'm thinking it'll be quite inconvenient for other European languages
whose alphabets include the base ASCII letters plus other stuff such
as accented letters. They will have a lot of words that fall into
the catchall "word" category, which will mean they have to index
mixed alpha-and-number words in order to catch all native words.

ISTM that perhaps a more generally useful definition would be

lword Only ASCII letters
nlword Entirely letters per iswalpha(), but not lword
word Entirely alphanumeric per iswalnum(), but not nlword
(hence, includes at least one digit)

However, I am no linguist and maybe I'm missing something.

Comments?

regards, tom lane

Responses

Re: Latin vs non-Latin words in text search parsing at 2007-10-21 21:59:53 from Alvaro Herrera
Re: Latin vs non-Latin words in text search parsing at 2007-10-23 22:45:58 from Tatsuo Ishii

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2007-10-21 21:59:53	Re: Latin vs non-Latin words in text search parsing
Previous Message	Kenneth Marshall	2007-10-21 20:45:08	Re: Hash index todo list item