Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

From: balasubramanian c r <crbs(dot)siebel(at)gmail(dot)com>
To: Ninad Shah <nshah(dot)postgres(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds
Date: 2021-09-02 18:09:40
Message-ID: CANnzXMMDRqRxHkod8iU7bKv+o59iosVeNvzpzmCJ-78+Rv4F6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

HI Ninad

Thanks for your reply.
If bitmap index should not be used. Do i need to disable it for the time
being and carry out the test.

The documentation in pg_similarity shows that index can be created on text
column using gin with gin_similarity_ops.
The same way the index is created like
CREATE INDEX on address using GIN(complete_address gin_similarity_ops);

AFAIK I have not seen any other operators other than gin in the
pg_smilarity extension.

Thanks
C.R.Bala

On Thu, Sep 2, 2021 at 8:17 PM Ninad Shah <nshah(dot)postgres(at)gmail(dot)com> wrote:

> Hi Bala,
>
> Are your statistics updated? Also, have you used the gin
> operator(gin_similarity_ops) correctly?
>
> It is fetching just 6 records out of a million, hence, it should not go
> for bitmap index scan. As bitmap index scan loads a complete index, and
> access relevant pages from the table later by bitmap heap scan.
>
>
> Regards,
> Ninad Shah
>
>
> On Thu, 2 Sept 2021 at 16:39, balasubramanian c r <crbs(dot)siebel(at)gmail(dot)com>
> wrote:
>
>> Hi Team
>>
>> We have encountered a problem in our testing environment.
>> I have a scenario where I am running a similarity match for an address
>> I have created a table with following number of records
>> 1603423
>>
>> We are using pg_similarity extension in postgresql version is 13.
>>
>> And I have created GIN index (since i am using pg_similarity) library
>> jaccard similarity method
>> when I run the Explain analyze
>> EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj
>> nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address
>> where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad
>> 201017' order by qsim DESC;
>> QUERY PLAN
>>
>> Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual
>> time=12101.194..12101.197 rows=6 loops=1)
>>
>> Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity
>> up ghaziabad 201017'::text)) DESC
>> Sort Method: quicksort Memory: 25kB
>> -> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93)
>> (actual time=3516.233..12101.172 rows=6 loops=1)
>> Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity
>> up ghaziabad 201017'::text)
>> Rows Removed by Index Recheck: 1039186
>> Heap Blocks: exact=58018
>> -> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02
>> rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
>> Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up
>> ghaziabad 201017'::text)
>> Planning Time: 0.141 ms
>> Execution Time: 12101.245 ms
>> (11 rows)
>> it took 12 seconds
>> following are my pgconf file settings
>> shared buffer as 4GB
>> work_mem 256 MB
>> maintenence_work_mem 512MB
>> autovacuum_work_mem 20MB
>> My index definition is this "address_complete_address_idx1" gin
>> (complete_address gin_similarity_ops)
>>
>> It is taking the index correctly. But why it took 12 seconds to process I
>> really don't understand.
>>
>> Please help.
>>
>> Thanks
>> C.R.Bala
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2021-09-02 18:13:49 Re: Upgrade 9.5 cluster on Ubuntu 16.04
Previous Message Charles Paperman 2021-09-02 18:00:03 Re: jsonpath duplication result