| From: | Nicolas Paris <niparisco(at)gmail(dot)com> | 
|---|---|
| To: | Oleg Bartunov <obartunov(at)gmail(dot)com> | 
| Cc: | Postgres General <pgsql-general(at)postgresql(dot)org> | 
| Subject: | Re: Full Text Search combined with Fuzzy | 
| Date: | 2017-03-03 12:49:05 | 
| Message-ID: | 20170303124904.GB781@gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Le 27 févr. 2017 à 10:32, Oleg Bartunov écrivait :
> 
> 
> On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco(at)gmail(dot)com> wrote:
> 
>     Hello,
> 
>     AFAIK there is no built-in way to combine full text search and fuzzy
>     matching
>     (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html)
>     By example, phrase searching with tipos in it.
> 
>     First I don't know if postgresql concurrents (lucene based...) are able
>     to do so.
> 
> 
> Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used for
> this.
> 
Hi Oleg,
Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.
Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.
I want my application returns texts that contains approximatly the user
entry:
Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.
Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
                   texts                   | similarity 
-------------------------------------------+------------
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  |     0.6875
(2 rows)
postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 width=0)
         Index Cond: (texts % 'hello word'::text)
(4 rows)
Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?
Thanks for the help !
> 
>     Second, is such feature is in the road map ?
> 
>     Third, I wonder if it is a good idea to use the postgresql synonyms
>     feature for such prupose.(https://www.postgresql.org/docs/current/static/
>     textsearch-dictionaries.html)
>     I mean, building up a synonyms dictionnary containing tipos. By eg:
> 
>     postgres        pgsql
>     postgresql      pgsql
>     postgrez        pgsql
>     postgre         pgsql
>     gogle           googl
>     gooogle         googl
> 
>     There is multiple way to build such dictionary. But my question is about
>     the implementation of dictionnaries in postgresql: Is postgresql
>     supposed to take advantage of billion entries dictionaries ?
> 
> 
> dictionary is just a program, so it's  up to developer how to write efficient
> program to deal with billion entries. Specifically to synonym dictionary, it's
> not intended to work with a lot of entries. btw, have a look on contrib/
> dict_xsyn dictionary, which is more flexible than synonym.
> 
> 
>     Thanks by advance for you answers,
>    
> 
>     --
>     Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-general
> 
> 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Geoff Winkless | 2017-03-03 12:58:02 | Re: ERROR: functions in index expression must be marked IMMUTABLE | 
| Previous Message | Sven R. Kunze | 2017-03-03 12:17:54 | Re: ERROR: functions in index expression must be marked IMMUTABLE |