From: | Nicolas Grilly <nicolas(at)gardentechno(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | ts_rank seems very slow (140 ranked documents / second on my machine) |
Date: | 2011-07-12 18:19:21 |
Message-ID: | CAG3yVS7Gdzzgr5+5z9Hq0X-YTLto5+Frcg_hrqDyg2Vm01Ob3w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello,
I'm testing PostgreSQL full-text search on a table containing
1.000.000 documents. Document average length is 5.700 chars.
Performance is good and very similar to what I can get with Xapian if
I don't use ts_rank. But response time collapses if I use ts_rank to
select the 50 best matching documents.
This is the table and index definition:
create table posts_1000000 (
id serial primary key,
document_vector tsvector
);
create index index_posts_documents_1000000 ON posts_1000000 USING
gin(document_vector);
This is the query without ts_rank (the word 'crare' matches 5 % of documents):
select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
limit 50
Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559
rows=50 loops=1)
Output: id
-> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184
width=4) (actual time=0.299..12.451 rows=50 loops=1)
Output: id
Filter: ('''crare'''::tsquery @@ document_vector)
Total runtime: 12.642 ms
Now, this is the query using ts_rank:
select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
limit 50
Limit (cost=29229.33..29229.45 rows=50 width=22) (actual
time=355516.233..355516.339 rows=50 loops=1)
Output: id
-> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual
time=355516.230..355516.268 rows=50 loops=1)
Output: id
Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184
width=22) (actual time=0.251..355389.367 rows=49951 loops=1)
Output: id
Filter: ('''crare'''::tsquery @@ document_vector)
Total runtime: 355535.063 ms
The ranking is very slow: 140 ranked documents / second on my machine!
I'm afraid this is because ts_rank needs to read document_vector, and
because that column is stored in TOAST table, it triggers a random
access for each matching row. Am I correct? Is it the expected
behavior? Is there a way to reduce the execution time?
I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB.
Thanks for your help and advice.
--
Nicolas Grilly
Garden
+33 1 45 72 48 78 - office
+33 6 03 00 25 34 - mobile
www.gardentechno.com - Développement web & reporting / Web development
& data analytics
www.vocationcity.com - Plateforme de recrutement sur le web / Web
recruitment platform
From | Date | Subject | |
---|---|---|---|
Next Message | Oleg Bartunov | 2011-07-12 20:25:08 | Re: ts_rank seems very slow (140 ranked documents / second on my machine) |
Previous Message | hyelluas | 2011-07-12 18:16:37 | Re: Concurrent read from a partition table. |