From: | Kevin Grittner <kgrittn(at)ymail(dot)com> |
---|---|
To: | Janek Sendrowski <janek12(at)web(dot)de>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Fastest Index/Algorithm to find similar sentences |
Date: | 2013-08-02 15:25:12 |
Message-ID: | 1375457112.43393.YahooMailNeo@web162905.mail.bf1.yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Janek Sendrowski <janek12(at)web(dot)de> wrote:
> I also tried pg_trgm module, which works with tri-grams, but it's
> also very slow with 100.000+ rows.
Hmm. I found the pg_trgm module very fast for name searches with
millions of rows *as long as I used KNN-GiST techniques*. Were you
careful to do so? Check out the "Index Support" section of this
page:
http://www.postgresql.org/docs/current/static/pgtrgm.html
While I have not tested this technique with a column containing
sentences, I would expect it to work well. As a quick
confirmation, I imported the text form of War and Peace into a
table, with one row per *line* (because that was easier than
parsing sentence boundaries for a quick test). That was over
65,000 rows.
test=# select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3;
lineno | linetext
--------+-----------------------------------------------------------------------
9082 | The gentlemen assembled at Bilibin's were young, wealthy, gay society
36575 | "Gentlemen, you are crushing me!..."
55997 | * "Good day, gentlemen."
(3 rows)
test=# explain analyze select * from war_and_peace order by linetext <-> 'young wealthy gay gentlemen' limit 3;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.80 rows=3 width=53) (actual time=37.986..38.002 rows=3 loops=1)
-> Index Scan using wap_text on war_and_peace (cost=0.28..11216.42 rows=65007 width=53) (actual time=37.984..37.999 rows=3 loops=1)
Order By: (linetext <-> 'young wealthy gay gentlemen'::text)
Total runtime: 38.180 ms
(4 rows)
To me, 38 milliseconds to search War and Peace for best matches to
a text string seems reasonable; I'm not sure what you're looking
for, since you didn't give any numbers.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Melvin Call | 2013-08-02 16:33:38 | Understanding database schemas |
Previous Message | Lionel Elie Mamane | 2013-08-02 15:18:11 | Identify primary key in simple/updatable view |