Questions regarding contrib/tsearch

From: "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Questions regarding contrib/tsearch
Date: 2002-08-02 09:30:31
Message-ID: 2266D0630E43BB4290742247C8910575014CE32A@dozer.computec.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello!

I have installed tsearch for several of our databases (PostgreSQL 7.2.1
in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's
really working like a charm - much more flexible than contrib/fti and
faster on inserts, too. Documentation still lacks a bit on explanation
about what exactly is happening - the fti-mechanism was quite easy to
grasp, the workings of tsearch seem more like a black box to me. But as
long as it does work, and it does work very well indeed, you won't find
me complaining :)

I have got two questions, though. First: Has somebody implemented this
with a dictionary for German already? I imagine that it could by a bit
difficult because of plural of words with diphtongs in it tend to
transform one vowel into an umlaut ("Haus"->"Häuser", "Maus"->"Mäuse"
etc.) and there's lots of more complicated grammar which doesn't confine
itself to changing a suffix... So german morphology cannot be quite as
easily described in an algorithm. And the umlaut-vowels and the sharp-s
are buggering me anyway, because we store them in HTML-transcription in
the database; therefore a search for "Fähre" (f&auml;hre = ferry) must
be written as "f&auml&hre" in the searchstring - and if there is just
one word containing an O-umlaut (&ouml, ö) in that row, a search for
"Föhre" (searchstring: "f&ouml&hre", the word is german for "pine tree")
or "führe" (searchstring: "f&uuml&hre", german subjunctive of the verb
"to drive") would lead to a hit, even though these words don't appear in
the indexed text at all. I can live with this tiny inaccuracy because
we've got a website about games and 99% of our searches are for
game-titles (and near enough all of them are in English, hardly any of
them contain an umlaut), but I'd be interested in your experiences with
this issue and how you resolved it - or if you just ignored it, too :)

My second question is about performance - I think I know the answer, but
I'd like to know if I'm correct. I've got a table containing
forum-messages with nearly 500,000 rows; the tsearch-indexed fields can
currently contain as much as 5,000 characters per row (we plan on
allowing about 12,000 characters in the near future), the field that
contains the txtidx-data is named textindex. Now I start a search for
messages containing the words 'Dungeon' and 'Siege':

First time search for 'dungeon&siege':
community=# explain analyze select count (*) from ct_com_board_message
where textindex ## 'dungeon&siege';
NOTICE: QUERY PLAN:
Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
time=24913.37..24913.38 rows=1 loops=1)
-> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1)
Total runtime: 24913.60 msec

Second time, same search:
community=# explain analyze select count (*) from ct_com_board_message
where textindex ## 'dungeon&siege';
NOTICE: QUERY PLAN:
Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
time=415.66..415.66 rows=1 loops=1)
-> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1)
Total runtime: 415.88 msec

Just for curiosity - number of results of this query:
community=# select count (*) from ct_com_board_message where textindex
## 'dungeon&siege';
count
-------
2647
(1 row)

Right now the database is not in full production state, as our main
website is not yet running on it - and this site is causing more than
90% of our overall database-traffic. This is the reason I need to get as
much performance as I can out of the searching, before switching this
last remaining site to PostgreSQL, too. I suspect that the high running
time for the first call of that query is due to the database having to
do harddisk-access in order to get the needed parts of the table into
memory. This would explain the acceptably low running time of the second
call - the information needed is already in memory, so there's no slow
harddisk-access involved and the query is completed quite quickly. Is
this correct? If so, what can I do to have all of the database in
memory? The machine has got 2 GB of RAM and if I dump all the databases
into one sql-script, the resulting file is about 600MB in size. Shurely
it should be possible to keep most of that in memory at all times? What
would I need to do to accomplish this? And is there a way to get even
more detailed information about query execution like how much time is
needed for the query-plan, for hd-access and so on?

Regards,

Markus Wollny

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Karel Zak 2002-08-02 09:51:39 Re: getpid() function
Previous Message Christopher Kings-Lynne 2002-08-02 09:15:38 Re: []performance issues