From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Rick Jansen <rick(at)rockingstone(dot)nl> |
Cc: | Mike Rylander <mrylander(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Tsearch2 performance on big database |
Date: | 2005-03-23 09:40:03 |
Message-ID: | Pine.GSO.4.62.0503231236390.5508@ra.sai.msu.su |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Wed, 23 Mar 2005, Rick Jansen wrote:
> Oleg Bartunov wrote:
>> On Tue, 22 Mar 2005, Rick Jansen wrote:
>>
>> Hmm, default configuration is too eager, you index every lexem using simple
>> dictionary) ! Probably, it's too much. Here is what I have for my russian
>> configuration in dictionary database:
>>
>> default_russian | lword | {en_ispell,en_stem}
>> default_russian | lpart_hword | {en_ispell,en_stem}
>> default_russian | lhword | {en_ispell,en_stem}
>> default_russian | nlword | {ru_ispell,ru_stem}
>> default_russian | nlpart_hword | {ru_ispell,ru_stem}
>> default_russian | nlhword | {ru_ispell,ru_stem}
>>
>> Notice, I index only russian and english words, no numbers, url, etc.
>> You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
>> but I'd recommend just update them setting dict_name to NULL.
>> For example, to not indexing integers:
>>
>> update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and
>> tok_alias='int';
>>
>> voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have
>> +70000 bucks');
>> token | dict_name | tok_type | tsvector
>> --------+---------------------+----------+----------
>> Do | {en_ispell,en_stem} | lword |
>> you | {en_ispell,en_stem} | lword |
>> have | {en_ispell,en_stem} | lword |
>> +70000 | | int |
>> bucks | {en_ispell,en_stem} | lword | 'buck'
>>
>> Only 'bucks' gets indexed :)
>> Hmm, probably I should add this into documentation.
>>
>> What about word statistics (# of unique words, for example).
>>
>
> I'm now following the guide to add the ispell dictionary and I've updated
> most of the rows setting dict_name to NULL:
>
> ts_name | tok_alias | dict_name
> -----------------+--------------+-----------
> default | lword | {en_stem}
> default | nlword | {simple}
> default | word | {simple}
> default | part_hword | {simple}
> default | nlpart_hword | {simple}
> default | lpart_hword | {en_stem}
> default | hword | {simple}
> default | lhword | {en_stem}
> default | nlhword | {simple}
>
> These are left, but I have no idea what a 'hword' or 'nlhword' or any other
> of these tokens are.
from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
I've asked how to know token types supported by parser. Actually, there is function token_type(parser), so you just use:
select * from token_type();
>
> Anyway, how do I find out the number of unique words or other word
> statistics?
from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
It's usefull to see words statistics, for example, to check how good your
dictionaries work or how did you configure pg_ts_cfgmap. Also, you may notice
probable stop words relevant for your collection.
Tsearch provides stat() function:
.......................
Don't hesitate to read it and if you find some bugs or know better wording
I'd be glad to improve my notes.
>
> Rick
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Tambet Matiisen | 2005-03-23 10:03:26 | SQL function inlining (was: View vs function) |
Previous Message | Dawid Kuroczko | 2005-03-23 09:35:48 | Re: best practices with index on varchar column |