From: | PFC <lists(at)peufeu(dot)com> |
---|---|
To: | Lutischán Ferenc <lutischanf(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Improve Seq scan performance |
Date: | 2008-11-17 10:51:32 |
Message-ID: | op.ukre36tbcigqcu@soyouz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
OK, I see your problem. Try this :
read this : http://www.postgresql.org/docs/current/static/pgtrgm.html
locate and \i the pg_trgm.sql file
CREATE TABLE dict( s TEXT );
I loaded the english - german dictionary in a test table. I didn't parse
it, so it's just a bunch of 418552 strings, english and german mixed.
test=> EXPLAIN ANALYZE SELECT * FROM dict WHERE s LIKE '%tation%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Seq Scan on dict (cost=0.00..7445.90 rows=133 width=13) (actual
time=0.102..217.155 rows=802 loops=1)
Filter: (s ~~ '%tation%'::text)
Total runtime: 217.451 ms
(3 lignes)
Temps : 217,846 ms
Since this data does not change very often, let's use a gin index.
CREATE INDEX trgm_idx ON dict USING gin (s gin_trgm_ops);
With trigrams we can search by similarity. So, we can issue this :
EXPLAIN ANALYZE SELECT s, similarity(s, 'tation') AS sml FROM dict WHERE s
% 'tation' ORDER BY sml DESC, s;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1114.44..1115.49 rows=419 width=13) (actual
time=190.778..190.980 rows=500 loops=1)
Sort Key: (similarity(s, 'tation'::text)), s
Sort Method: quicksort Memory: 37kB
-> Bitmap Heap Scan on dict (cost=35.80..1096.19 rows=419 width=13)
(actual time=113.486..188.825 rows=500 loops=1)
Filter: (s % 'tation'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..35.69 rows=419
width=0) (actual time=112.011..112.011 rows=15891 loops=1)
Index Cond: (s % 'tation'::text)
Total runtime: 191.189 ms
It is not much faster than the seq scan, but it can give you useful
results, correct spelling errors, etc.
Perhaps it's faster when data is not cached.
Sample of returns :
taxation | 0.6
station | 0.5
tabulation | 0.5
taction | 0.5
Taxation {f} | 0.5
Taxation {f} | 0.5
If you do not want to correct for spelling errors, you can do like this :
EXPLAIN ANALYZE SELECT s FROM dict WHERE s LIKE '%tation%' AND s %
'tation';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on dict (cost=35.70..1096.09 rows=1 width=13) (actual
time=66.583..80.980 rows=306 loops=1)
Filter: ((s ~~ '%tation%'::text) AND (s % 'tation'::text))
-> Bitmap Index Scan on trgm_idx (cost=0.00..35.69 rows=419 width=0)
(actual time=65.799..65.799 rows=15891 loops=1)
Index Cond: (s % 'tation'::text)
Total runtime: 81.140 ms
(5 lignes)
Temps : 81,652 ms
In this case the trigram index is used to narrow the search, and the LIKE
to get only exact matches.
Careful though, it might not always match, for instance if you search
"rat" you won't find "consideration", because the search string is too
small.
Anyway, I would suggest to change your strategy.
You could try preloading everything into an in-memory array of strings.
This would be much faster.
You could also try to build a list of unique words from your dictionary,
which contains lots of expressions. Then, when the user enters a query,
get the words that contain the entered text, and use a full-text index to
search your dictionary.
> I tested first only some words. And later with '%a%', '%b% etc. When I
> re-query the table with the used term (e.g. 1.'%a%' -slow, 2. '%b%'-
> slow, '%a%' - fast), it is faster than the old method.
When the user enters a very short string like 'a' or 'is', I don't think
it is relevant to display all entries that contain this, because that
could be most of your dictionary. Instead, why not display all unique
words which start with this string ? Much less results, faster, and
probably more useful too. Then the user can select an longer word and use
this.
Also, pagination is overrated. If there are 50 pages of results, the user
will never click on them anyway. They are more likely to refine their
query instead. So, just display the first 100 results and be done with it
;)
From | Date | Subject | |
---|---|---|---|
Next Message | Dimi Paun | 2008-11-17 15:53:17 | Bad performance on simple query |
Previous Message | Віталій Тимчишин | 2008-11-17 09:52:21 | Re: PostgreSQL OR performance |