Re: Questions regarding contrib/tsearch

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Markus Wollny <Markus(dot)Wollny(at)computec(dot)de>
Cc: pgsql-general(at)postgresql(dot)org, openfts discussion <openfts-general(at)lists(dot)sourceforge(dot)net>
Subject: Re: Questions regarding contrib/tsearch
Date: 2002-08-02 11:55:02
Message-ID: Pine.GSO.4.44.0208021429530.25663-100000@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 2 Aug 2002, Markus Wollny wrote:

> Hello!
>
> I have installed tsearch for several of our databases (PostgreSQL 7.2.1
> in SuSE Linux 7.3, 4xPIIIXeon550, 2GB RAM, RAID5 /w 5x18GB) and it's
> really working like a charm - much more flexible than contrib/fti and
> faster on inserts, too. Documentation still lacks a bit on explanation
> about what exactly is happening - the fti-mechanism was quite easy to
> grasp, the workings of tsearch seem more like a black box to me. But as
> long as it does work, and it does work very well indeed, you won't find
> me complaining :)

Good news
>
> I have got two questions, though. First: Has somebody implemented this
> with a dictionary for German already? I imagine that it could by a bit
> difficult because of plural of words with diphtongs in it tend to
> transform one vowel into an umlaut ("Haus"->"HДuser", "Maus"->"MДuse"
> etc.) and there's lots of more complicated grammar which doesn't confine
> itself to changing a suffix... So german morphology cannot be quite as
> easily described in an algorithm. And the umlaut-vowels and the sharp-s
> are buggering me anyway, because we store them in HTML-transcription in
> the database; therefore a search for "FДhre" (f&auml;hre = ferry) must
> be written as "f&auml&hre" in the searchstring - and if there is just
> one word containing an O-umlaut (&ouml, Ж) in that row, a search for
> "FЖhre" (searchstring: "f&ouml&hre", the word is german for "pine tree")
> or "fЭhre" (searchstring: "f&uuml&hre", german subjunctive of the verb
> "to drive") would lead to a hit, even though these words don't appear in
> the indexed text at all. I can live with this tiny inaccuracy because
> we've got a website about games and 99% of our searches are for
> game-titles (and near enough all of them are in English, hardly any of
> them contain an umlaut), but I'd be interested in your experiences with
> this issue and how you resolved it - or if you just ignored it, too :)

Marcus, OpenFTS uses contrib/tsearch as a based data type and is
(currently) much flexible in respect of language support. Particularly,
it has support of Snowball stemmers (http://snowball.sourceforge.net/)
So, in principle it should work with German. We have plan to continue
our work on tsearch, but currently we're quite busy. You may try
to write snowball interface to tsearch yourself.

>
> My second question is about performance - I think I know the answer, but
> I'd like to know if I'm correct. I've got a table containing
> forum-messages with nearly 500,000 rows; the tsearch-indexed fields can
> currently contain as much as 5,000 characters per row (we plan on
> allowing about 12,000 characters in the near future), the field that
> contains the txtidx-data is named textindex. Now I start a search for
> messages containing the words 'Dungeon' and 'Siege':
>
> First time search for 'dungeon&siege':
> community=# explain analyze select count (*) from ct_com_board_message
> where textindex ## 'dungeon&siege';
> NOTICE: QUERY PLAN:
> Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
> time=24913.37..24913.38 rows=1 loops=1)
> -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
> rows=487 width=0) (actual time=14.62..24899.67 rows=2647 loops=1)
> Total runtime: 24913.60 msec
>
> Second time, same search:
> community=# explain analyze select count (*) from ct_com_board_message
> where textindex ## 'dungeon&siege';
> NOTICE: QUERY PLAN:
> Aggregate (cost=1985.40..1985.40 rows=1 width=0) (actual
> time=415.66..415.66 rows=1 loops=1)
> -> Index Scan using t_idx on ct_com_board_message (cost=0.00..1984.18
> rows=487 width=0) (actual time=0.12..407.31 rows=2647 loops=1)
> Total runtime: 415.88 msec
>
> Just for curiosity - number of results of this query:
> community=# select count (*) from ct_com_board_message where textindex
> ## 'dungeon&siege';
> count
> -------
> 2647
> (1 row)
>
> Right now the database is not in full production state, as our main
> website is not yet running on it - and this site is causing more than
> 90% of our overall database-traffic. This is the reason I need to get as
> much performance as I can out of the searching, before switching this
> last remaining site to PostgreSQL, too. I suspect that the high running
> time for the first call of that query is due to the database having to
> do harddisk-access in order to get the needed parts of the table into
> memory. This would explain the acceptably low running time of the second
> call - the information needed is already in memory, so there's no slow
> harddisk-access involved and the query is completed quite quickly. Is
> this correct? If so, what can I do to have all of the database in
> memory? The machine has got 2 GB of RAM and if I dump all the databases
> into one sql-script, the resulting file is about 600MB in size. Shurely
> it should be possible to keep most of that in memory at all times? What
> would I need to do to accomplish this? And is there a way to get even
> more detailed information about query execution like how much time is
> needed for the query-plan, for hd-access and so on?
>

You're right ! The more data, the bigger index. There were several threads
in -hackers list about optimizing postgresql (shared memory, buffers ...).
Your database certainly should fits 2Gb

But there are several tsearch specific recommendations:

1. Use morphology, stemming and stop words. This could greatly reduce
size of index ! All postgresql documentation is consist of about
8 Kb distinct words after stemming. I don't remember if I used
stop words.

2. Use OpenFTS for now. It's much more flexible and also provides
ranking of search results. Also, it's possible to specify
different dictionaries for different languages, specify what type
of lexemes to index (currently it recognizes 23 classes) and this
is also a big win, because tsearch (currently) indexes everything !
But users usually need just words + some kind of numbers.

We're about to release 0.33 version of OpenFTS which is based on tsearch
module. I've written "The Crash-course to OpenFTS" , you may read it
currently on our page (http://www.sai.msu.su/~megera/postgres/gist/)

> Regards,
>
> Markus Wollny
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tourtounis Sotiris 2002-08-02 12:02:38 How i can empty the buffers of a db
Previous Message Yaroslav Dmitriev 2002-08-02 11:48:39 Re: []performance issues