Re: Slow search.. quite clueless

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Philip Hallstrom <postgresql(at)philip(dot)pjkh(dot)com>
Cc: Yonatan Ben-Nes <da(at)canaan(dot)co(dot)il>, pgsql-general(at)postgresql(dot)org
Subject: Re: Slow search.. quite clueless
Date: 2005-09-20 18:34:48
Message-ID: Pine.GSO.4.63.0509202233190.20320@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 20 Sep 2005, Philip Hallstrom wrote:

>> contrib/tsearch2 ( http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
>> )
>> might works for you. It might because performance depends on cardinality of
>> your keywords.
>
> Seconded. We use tsearch2 to earch about 40,000 rows containing
> manufacturer, brand, and product name and it returns a result almost
> instantly. Before when we did normal SQL "manufacture LIKE ..., etc." it
> would take 20-30 seconds.
>
> One thing to check is the english.stop file which contains words to skip (i,
> a, the, etc.). In our case we removed almost all of them since one of our
> products is "7 up" (the drink) and it would remove "up". Made it really hard
> to pull up 7 up in the results :)

we have "rewriting query support ( thesauri search)" in our todo
(http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo)

>
> -philip
>
>>
>> Oleg
>> On Tue, 20 Sep 2005, Yonatan Ben-Nes wrote:
>>
>>> Hi all,
>>>
>>> Im building a site where the users can search for products with up to 4
>>> diffrent keywords which all MUST match to each product which found as a
>>> result to the search.
>>>
>>> I got 2 tables (which are relevant to the issue :)), one is the product
>>> table (5 million rows) and the other is the keyword table which hold the
>>> keywords of each product (60 million rows).
>>>
>>> The scheme of the tables is as follows:
>>>
>>> Table "public.product"
>>> Column | Type | Modifiers
>>> ----------------------------+---------------+---------------------
>>> product_id | text | not null
>>> product_name | text | not null
>>> retail_price | numeric(10,2) | not null
>>> etc...
>>> Indexes:
>>> "product_product_id_key" UNIQUE, btree (product_id)
>>>
>>> Table "public.keyword"
>>> Column | Type | Modifiers
>>> -------------+---------------+-----------
>>> product_id | text | not null
>>> keyword | text | not null
>>> Indexes:
>>> "keyword_keyword" btree (keyword)
>>>
>>> The best query which I succeded to do till now is adding the keyword table
>>> for each keyword searched for example if someone search for "belt" &
>>> "black" & "pants" it will create the following query:
>>>
>>> poweraise.com=# EXPLAIN ANALYZE SELECT
>>> product_id,product_name,product_image_url,short_product_description,long_product_description,discount,discount_type,sale_price,retail_price
>>> FROM product INNER JOIN keyword t1 USING(product_id) INNER JOIN keyword t2
>>> USING(product_id) INNER JOIN keyword t3 USING(product_id) WHERE
>>> t1.keyword='belt' AND t2.keyword='black' AND t3.keyword='pants' LIMIT 13;
>>>
>>> QUERY PLAN
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> Limit (cost=37734.15..39957.20 rows=13 width=578) (actual
>>> time=969.798..1520.354 rows=6 loops=1)
>>> -> Hash Join (cost=37734.15..3754162.82 rows=21733 width=578) (actual
>>> time=969.794..1520.337 rows=6 loops=1)
>>> Hash Cond: ("outer".product_id = "inner".product_id)
>>> -> Nested Loop (cost=18867.07..2858707.34 rows=55309 width=612)
>>> (actual time=82.266..1474.018 rows=156 loops=1)
>>> -> Hash Join (cost=18867.07..2581181.09 rows=55309
>>> width=34) (actual time=82.170..1462.104 rows=156 loops=1)
>>> Hash Cond: ("outer".product_id = "inner".product_id)
>>> -> Index Scan using keyword_keyword on keyword t2
>>> (cost=0.00..331244.43 rows=140771 width=17) (actual time=0.033..1307.167
>>> rows=109007 loops=1)
>>> Index Cond: (keyword = 'black'::text)
>>> -> Hash (cost=18851.23..18851.23 rows=6337 width=17)
>>> (actual time=16.145..16.145 rows=0 loops=1)
>>> -> Index Scan using keyword_keyword on keyword
>>> t1 (cost=0.00..18851.23 rows=6337 width=17) (actual time=0.067..11.050
>>> rows=3294 loops=1)
>>> Index Cond: (keyword = 'belt'::text)
>>> -> Index Scan using product_product_id_key on product
>>> (cost=0.00..5.01 rows=1 width=578) (actual time=0.058..0.060 rows=1
>>> loops=156)
>>> Index Cond: (product.product_id = "outer".product_id)
>>> -> Hash (cost=18851.23..18851.23 rows=6337 width=17) (actual
>>> time=42.863..42.863 rows=0 loops=1)
>>> -> Index Scan using keyword_keyword on keyword t3
>>> (cost=0.00..18851.23 rows=6337 width=17) (actual time=0.073..36.120
>>> rows=3932 loops=1)
>>> Index Cond: (keyword = 'pants'::text)
>>> Total runtime: 1521.441 ms
>>> (17 rows)
>>>
>>> Sometimes the query work fast even for 3 keywords but that doesnt help me
>>> if at other times it take ages....
>>>
>>> Now to find a result for 1 keyword its really flying so I also tried to
>>> make 3 queries and do INTERSECT between them but it was found out to be
>>> extremly slow...
>>>
>>> Whats make this query slow as far as I understand is all the merging
>>> between the results of each table... I tried to divide the keyword table
>>> into lots of keywords table which each hold keywords which start only with
>>> a specific letter, it did improve the speeds but not in a real significant
>>> way.. tried clusters,indexes,SET STATISTICS,WITHOUT OIDS on the keyword
>>> table and what not.. im quite clueless...
>>>
>>> Actually I even started to look on other solutions and maybe you can say
>>> something about them also.. maybe they can help me:
>>> 1. Omega (From the Xapian project) - http://www.xapian.org/
>>> 2. mnoGoSearch - http://www.mnogosearch.org/doc.html
>>> 3. Swish-e - http://swish-e.org/index.html
>>>
>>> To add on everything I want at the end to be able to ORDER BY the results
>>> like order the product by price, but im less concerned about that cause I
>>> saw that with cluster I can do it without any extra overhead.
>>>
>>> Thanks alot in advance,
>>> Yonatan Ben-Nes
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 1: if posting/reading through Usenet, please send an appropriate
>>> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>>> message can get through to the mailing list cleanly
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>> Sternberg Astronomical Institute, Moscow University (Russia)
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(095)939-16-83, +007(095)939-23-83
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>> choose an index scan if your joining column's datatypes do not
>> match
>>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dean Gibson (DB Administrator) 2005-09-20 18:42:07 Re: running vacuum in scripts
Previous Message Philip Hallstrom 2005-09-20 18:14:31 Re: Slow search.. quite clueless