From: | Nicolas Paris <niparisco(at)gmail(dot)com> |
---|---|
To: | Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> |
Cc: | Oleg Bartunov <obartunov(at)gmail(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Full Text Search combined with Fuzzy |
Date: | 2017-03-03 13:17:26 |
Message-ID: | 20170303131726.GC781@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Le 03 mars 2017 à 14:08, Artur Zakirov écrivait :
> On 03.03.2017 15:49, Nicolas Paris wrote:
> >
> >Hi Oleg,
> >
> >Thanks. I thought pgtrgm was not able to index my long texts because of
> >limitation of 8191 bytes per index row for btree.
> >
> >Then I found out it is possible to use pgtrgm over a GIN/GIST index.
> >My final use case is phrase mining in texts.
> >
> >I want my application returns texts that contains approximatly the user
> >entry:
> >
> >Eg: user search "Hello Word"
> >a text containing "blah blah blah hello world blah blah blah" would be
> >returned.
> >
> >Test:
> >postgres=# CREATE table test_trgm (texts text);
> >CREATE TABLE
> >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
> >CREATE INDEX
> >postgres=# SET enable_seqscan = OFF;
> >SET
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
> >INSERT 0 1
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
> >INSERT 0 1
> >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
> > texts | similarity
> >-------------------------------------------+------------
> > blah blah blah hello world blah blah blah | 0.473684
> > blah blah blah hello word blah blah blah | 0.6875
> >(2 rows)
> >
> >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
> > QUERY PLAN
> >-----------------------------------------------------------------------------------
> > Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32)
> > Recheck Cond: (texts % 'hello word'::text)
> > -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0)
> > Index Cond: (texts % 'hello word'::text)
> >(4 rows)
> >
> >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
> >my requirements ?
> >
> >Thanks for the help !
> >
>
> Hello,
>
> If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
> example:
>
> postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm
> WHERE 'hello word' <% texts;
> texts | word_similarity
> -------------------------------------------+-----------------
> blah blah blah hello world blah blah blah | 0.818182
> blah blah blah hello word blah blah blah | 1
> (2 rows)
>
> 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html
>
Nice ! I do have 9.6 version.
Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.
The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?
Is there any possibility in the future to add typo in the full text
road-map ?
Thanks,
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2017-03-03 13:32:48 | Re: DISTINCT vs GROUP BY - was Re: is (not) distinct from |
Previous Message | Artur Zakirov | 2017-03-03 13:08:19 | Re: Full Text Search combined with Fuzzy |