Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

From: balasubramanian c r <crbs(dot)siebel(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Lewis <mlewis(at)entrata(dot)com>, Ninad Shah <nshah(dot)postgres(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds
Date: 2021-09-02 22:12:26
Message-ID: CANnzXMMM+d_v=oWs2k1Sqr5cV943X-R+PcD0Rc_Uac_43ruO3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

HI Tom/Ninad
My bad I didn't explain my use case properly.
The use case is to find the best string similarity for a given address
against the list of addresses in the table.
Initially I tried a similarity function provided by the pg_trgm extension.
But the similarity scores were not satisfactory.
Later I explored the pg_similarity extension which had an exhaustive list
of functions which supported indexes primarily on GIN.
After trying multiple functions like levenshtein, cosine, qgram, jaro, jaro
winkler and jaccard similarity functions
jaccard similarity functions were providing better accurate results.

Hence we decided to use that for our analysis.

Now with a small amount of data the query response time is better. But with
such 1.6 million it is taking a long time.
Even the documentation owner of pg_similarity provided an example on how to
create an index using GIN. We followed exactly the same process.
But still when the data is huge we don't know why it is taking time to scan
through the records.

Thanks
C.R.Bala

On Fri, Sep 3, 2021 at 1:14 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Michael Lewis <mlewis(at)entrata(dot)com> writes:
> > This is showing many false positives from the index scan that get removed
> > when the actual values are examined. With such a long search parameter,
> > that does not seem surprising. I would expect a search on "raj nagar
> > ghaziabad 201017" or something like that to yield far fewer results from
> > the index scan. I don't know GIN indexes super well, but I would guess
> that
> > including words that are very common will yield false positives that get
> > filtered out later.
>
> Yeah, the huge "Rows Removed" number shows that this index is very
> poorly adapted to the query. I don't think the problem is with GIN
> per se, but with a poor choice of how to use it. The given example
> looks like what the OP really wants to do is full text search.
> If so, a GIN index should be fine as long as you put tsvector/tsquery
> filtering in front of it. If that's not a good characterization of
> the goal, it'd help to tell us what the goal is. (Just saying "I
> want to use jaccard similarity" sounds a lot like a man whose only
> tool is a hammer, therefore his problem must be a nail, despite
> evidence to the contrary.)
>
> regards, tom lane
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Shubham Mittal 2021-09-02 22:15:32 Query takes around 15 to 20 min over 20Lakh rows
Previous Message Nikolay Samokhvalov 2021-09-02 21:21:26 Re: Upgrade 9.5 cluster on Ubuntu 16.04