Re: ts_rank seems very slow (140 ranked documents / second on my machine)

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-13 15:36:11
Message-ID: Pine.LNX.4.64.1107131932410.17363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I didn't notice, reading 40K tuples in random order takes a long time and this
is a problem of any database. Can you measure time to read all documents found ?
:( The only solution I see is to store enough
information for ranking in index.

Oleg
On Wed, 13 Jul 2011, Nicolas Grilly wrote:

> Hi Oleg and all,
>
> On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>> there is problem with estimating of cost scanning gin index in < 9.1
>> versions,
>> so you can set enable_seqscan=off;
>> or try 9.1 which beta3 now.
>
> I re-ran my queries using enable seqscan=off.
>
> Now the first query, without ts_rank, uses the GIN index:
>
> set enable_seqscan=off;
> explain analyze select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> limit 50;
>
> Limit (cost=42290.12..42306.31 rows=50 width=4) (actual
> time=16.259..16.412 rows=50 loops=1)
> -> Bitmap Heap Scan on posts_1000000 (cost=42290.12..57877.02
> rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1)
> Recheck Cond: ('''crare'''::tsquery @@ document_vector)
> -> Bitmap Index Scan on index_posts_documents_1000000
> (cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265
> rows=49951 loops=1)
> Index Cond: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 16.484 ms
>
> But the second query, the one that uses ts_rank, is still very slow...
> Any idea why? Is ts_rank efficient enough to find the best 50 matches
> among 50 000 documents?
>
> set enable_seqscan=off;
> explain analyze select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
> limit 50;
>
> Limit (cost=59596.98..59597.10 rows=50 width=22) (actual
> time=296212.052..296212.257 rows=50 loops=1)
> -> Sort (cost=59596.98..59717.36 rows=48152 width=22) (actual
> time=296186.928..296187.007 rows=50 loops=1)"
> Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
> Sort Method: top-N heapsort Memory: 27kB
> -> Bitmap Heap Scan on posts_1000000
> (cost=42290.12..57997.40 rows=48152 width=22) (actual
> time=70.861..296059.515 rows=49951 loops=1)
> Recheck Cond: ('''crare'''::tsquery @@ document_vector)
> -> Bitmap Index Scan on index_posts_documents_1000000
> (cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922
> rows=49951 loops=1)
> Index Cond: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 296220.493 ms
>
>>> By the way, does ts_rank is supposed to use a GIN index when it's
>>> available?
>>
>> no, I see no benefit :)
>
> Ok. But what is the solution to improve ts_rank execution time? Am I
> doing something wrong?
>
> Thanks for your help,
>
> Nicolas
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Joseph Marlin 2011-07-13 15:55:45 Re: Web-based Graphical Query Building Tool for PostgreSQL
Previous Message Lars Kanis 2011-07-13 13:59:02 Using LDAP roles in PostgreSQL