Quick Links

Re: Slow search.. quite clueless

From:	Yonatan Ben-Nes <da(at)canaan(dot)co(dot)il>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Philip Hallstrom <postgresql(at)philip(dot)pjkh(dot)com>, pgsql-general(at)postgresql(dot)org, Alex Turner <armtuk(at)gmail(dot)com>
Subject:	Re: Slow search.. quite clueless
Date:	2005-09-21 20:24:20
Message-ID:	4331C174.3090207@canaan.co.il
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Oleg Bartunov wrote:
> On Tue, 20 Sep 2005, Philip Hallstrom wrote:
>
>>> contrib/tsearch2 (
>>> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ )
>>> might works for you. It might because performance depends on
>>> cardinality of your keywords.
>>
>>
>> Seconded. We use tsearch2 to earch about 40,000 rows containing
>> manufacturer, brand, and product name and it returns a result almost
>> instantly. Before when we did normal SQL "manufacture LIKE ..., etc."
>> it would take 20-30 seconds.
>>
>> One thing to check is the english.stop file which contains words to
>> skip (i, a, the, etc.). In our case we removed almost all of them
>> since one of our products is "7 up" (the drink) and it would remove
>> "up". Made it really hard to pull up 7 up in the results :)
>
>
> we have "rewriting query support ( thesauri search)" in our todo
> (http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo)
>
>
>>
>> -philip
>>
>>>
>>> Oleg
>>> On Tue, 20 Sep 2005, Yonatan Ben-Nes wrote:
>>>
>>>> Hi all,
>>>>
>>>> Im building a site where the users can search for products with up
>>>> to 4 diffrent keywords which all MUST match to each product which
>>>> found as a result to the search.
>>>>
>>>> I got 2 tables (which are relevant to the issue :)), one is the
>>>> product table (5 million rows) and the other is the keyword table
>>>> which hold the keywords of each product (60 million rows).
>>>>
>>>> The scheme of the tables is as follows:
>>>>
>>>> Table "public.product"
>>>> Column | Type | Modifiers
>>>> ----------------------------+---------------+---------------------
>>>> product_id | text | not null
>>>> product_name | text | not null
>>>> retail_price | numeric(10,2) | not null
>>>> etc...
>>>> Indexes:
>>>> "product_product_id_key" UNIQUE, btree (product_id)
>>>>
>>>> Table "public.keyword"
>>>> Column | Type | Modifiers
>>>> -------------+---------------+-----------
>>>> product_id | text | not null
>>>> keyword | text | not null
>>>> Indexes:
>>>> "keyword_keyword" btree (keyword)
>>>>
>>>> The best query which I succeded to do till now is adding the keyword
>>>> table for each keyword searched for example if someone search for
>>>> "belt" & "black" & "pants" it will create the following query:
>>>>
>>>> poweraise.com=# EXPLAIN ANALYZE SELECT
>>>> product_id,product_name,product_image_url,short_product_description,long_product_description,discount,discount_type,sale_price,retail_price
>>>> FROM product INNER JOIN keyword t1 USING(product_id) INNER JOIN
>>>> keyword t2 USING(product_id) INNER JOIN keyword t3 USING(product_id)
>>>> WHERE t1.keyword='belt' AND t2.keyword='black' AND
>>>> t3.keyword='pants' LIMIT 13;
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> Limit (cost=37734.15..39957.20 rows=13 width=578) (actual
>>>> time=969.798..1520.354 rows=6 loops=1)
>>>> -> Hash Join (cost=37734.15..3754162.82 rows=21733 width=578)
>>>> (actual time=969.794..1520.337 rows=6 loops=1)
>>>> Hash Cond: ("outer".product_id = "inner".product_id)
>>>> -> Nested Loop (cost=18867.07..2858707.34 rows=55309
>>>> width=612) (actual time=82.266..1474.018 rows=156 loops=1)
>>>> -> Hash Join (cost=18867.07..2581181.09 rows=55309
>>>> width=34) (actual time=82.170..1462.104 rows=156 loops=1)
>>>> Hash Cond: ("outer".product_id =
>>>> "inner".product_id)
>>>> -> Index Scan using keyword_keyword on keyword
>>>> t2 (cost=0.00..331244.43 rows=140771 width=17) (actual
>>>> time=0.033..1307.167 rows=109007 loops=1)
>>>> Index Cond: (keyword = 'black'::text)
>>>> -> Hash (cost=18851.23..18851.23 rows=6337
>>>> width=17) (actual time=16.145..16.145 rows=0 loops=1)
>>>> -> Index Scan using keyword_keyword on
>>>> keyword t1 (cost=0.00..18851.23 rows=6337 width=17) (actual
>>>> time=0.067..11.050 rows=3294 loops=1)
>>>> Index Cond: (keyword = 'belt'::text)
>>>> -> Index Scan using product_product_id_key on product
>>>> (cost=0.00..5.01 rows=1 width=578) (actual time=0.058..0.060 rows=1
>>>> loops=156)
>>>> Index Cond: (product.product_id =
>>>> "outer".product_id)
>>>> -> Hash (cost=18851.23..18851.23 rows=6337 width=17)
>>>> (actual time=42.863..42.863 rows=0 loops=1)
>>>> -> Index Scan using keyword_keyword on keyword t3
>>>> (cost=0.00..18851.23 rows=6337 width=17) (actual time=0.073..36.120
>>>> rows=3932 loops=1)
>>>> Index Cond: (keyword = 'pants'::text)
>>>> Total runtime: 1521.441 ms
>>>> (17 rows)
>>>>
>>>> Sometimes the query work fast even for 3 keywords but that doesnt
>>>> help me if at other times it take ages....
>>>>
>>>> Now to find a result for 1 keyword its really flying so I also tried
>>>> to make 3 queries and do INTERSECT between them but it was found out
>>>> to be extremly slow...
>>>>
>>>> Whats make this query slow as far as I understand is all the merging
>>>> between the results of each table... I tried to divide the keyword
>>>> table into lots of keywords table which each hold keywords which
>>>> start only with a specific letter, it did improve the speeds but not
>>>> in a real significant way.. tried clusters,indexes,SET
>>>> STATISTICS,WITHOUT OIDS on the keyword table and what not.. im quite
>>>> clueless...
>>>>
>>>> Actually I even started to look on other solutions and maybe you can
>>>> say something about them also.. maybe they can help me:
>>>> 1. Omega (From the Xapian project) - http://www.xapian.org/
>>>> 2. mnoGoSearch - http://www.mnogosearch.org/doc.html
>>>> 3. Swish-e - http://swish-e.org/index.html
>>>>
>>>> To add on everything I want at the end to be able to ORDER BY the
>>>> results like order the product by price, but im less concerned about
>>>> that cause I saw that with cluster I can do it without any extra
>>>> overhead.
>>>>
>>>> Thanks alot in advance,
>>>> Yonatan Ben-Nes
>>>>
>>>>
>>>> ---------------------------(end of
>>>> broadcast)---------------------------
>>>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>>>> message can get through to the mailing list cleanly
>>>>
>>>
>>> Regards,
>>> Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>>> Sternberg Astronomical Institute, Moscow University (Russia)
>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>> phone: +007(095)939-16-83, +007(095)939-23-83
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>>> choose an index scan if your joining column's datatypes do not
>>> match
>>>
>>
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83

Hi all,

First of all sorry for the delay we had a problem with out mail server...

The tsearch2 looks really promising, im starting to work with it now and
ill report what ill find.

And to Alex thanks but I tried already all of the things you recommended
and sadly it didnt help.

Thanks alot for the help everyone!
Yonatan Ben-Nes

In response to

Re: Slow search.. quite clueless at 2005-09-20 18:34:48 from Oleg Bartunov

Browse pgsql-general by date

	From	Date	Subject
Next Message	Douglas McNaught	2005-09-21 20:38:48	Re: COPY and Unicode...
Previous Message	CSN	2005-09-21 20:12:20	Re: index row size exceeds btree maximum