Re: Fastest Index/Algorithm to find similar sentences

From: Beena Emerson <memissemerson(at)gmail(dot)com>
To: Janek Sendrowski <janek12(at)web(dot)de>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fastest Index/Algorithm to find similar sentences
Date: 2013-07-31 13:53:35
Message-ID: CAOG9ApG08sdgjEd8WvfNZdPR5UoUqgdn4Sb5Bc7aJjEeTQd-ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Jul 27, 2013 at 10:34 PM, Janek Sendrowski <janek12(at)web(dot)de> wrote:

> Hi Sergey Konoplev,
>
> If I'm searching for a sentence like "The tiger is the largest cat
> species" for example.
>
> I can only find the sentences, which include the words "tiger, largest,
> cat, species", but I also like to have the sentences with only three or
> even two of these words.
>
> Janek
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Hi,

You may use similarity functions of pg_trgm.

Example:
=# \d+ test
Table "public.test"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------+-----------+----------+--------------+-------------
col | text | | extended | |
Indexes:
"test_idx" gin (col gin_trgm_ops)
Has OIDs: no

# SELECT * FROM test;
col
-----------------------------------------
The tiger is the largest cat species
The cheetah is the fastest cat species
The peacock is the largest bird species
(3 rows)

=# SELECT show_limit();
show_limit
------------
0.3
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS
sml
FROM test WHERE col % 'The tiger is the largest cat species'
ORDER BY sml DESC, col;
col | sml
-----------------------------------------+----------
The tiger is the largest cat species | 1
The peacock is the largest bird species | 0.511111
The cheetah is the fastest cat species | 0.466667
(3 rows)

=# SELECT set_limit(0.5);
set_limit
-----------
0.5
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS
sml
FROM test WHERE col % 'The tiger is the largest cat species'
ORDER BY sml DESC, col;
col | sml
-----------------------------------------+----------
The tiger is the largest cat species | 1
The peacock is the largest bird species | 0.511111
(2 rows)

=# SELECT set_limit(0.9);
set_limit
-----------
0.9
(1 row)

=# SELECT col, similarity(col, 'The tiger is the largest cat species') AS
sml
FROM test WHERE col % 'The tiger is the largest cat species'
ORDER BY sml DESC, col;
col | sml
--------------------------------------+-----
The tiger is the largest cat species | 1
(1 row)

When you set a higher limit, you get more exact matches.

--
Beena Emerson

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Beena Emerson 2013-07-31 13:56:22 Re: Fastest Index/Algorithm to find similar sentences
Previous Message hidayat365 2013-07-31 13:52:34 Re: Postgres 9.2.4 for Windows (Vista) Dell Vostro 400, re-installation failure PLEASE CAN SOMEONE HELP!!