Re: how to make duplicate finding query faster?

From: "Gavan Schneider" <list(dot)pg(dot)gavan(at)pendari(dot)org>
To: "Sachin Kumar" <sachinkumaras(at)gmail(dot)com>
Cc: "Holger Jakobs" <holger(at)jakobs(dot)com>, krishna(at)thewebconz(dot)com, pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: how to make duplicate finding query faster?
Date: 2020-12-30 11:21:17
Message-ID: 43C74F51-3099-4787-8C73-CE928CEAB01A@pendari.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On 30 Dec 2020, at 19:59, Sachin Kumar wrote:

> This will not suffice our requirement. we have to validate that there
> would not be any duplicate value in DB. we have done that earlier by
> leaving DB to check if there is any duplicate and found duplicate
> value in
> DB.
>
If the constraint was in place *before* any data was inserted into the
database then there will be no duplicates. Adding a constraint to a
database will not remove existing data which violates the constraint.

> we have a table "card" and a single column "number " which we are
> updated
> with a csv file with 600k numbers and require no duplicate number
> should be
> there in the table.
>
One trap is that the data may not be the same even if it’s a
“duplicate”. The “number” may have spaces or other human
readable characters. It is a duplicate from the human perspective but
not the machine. These “duplicates” will be also be very hard to
find with a simple query. If OP is not already doing this I suggest the
“number” should be stored in the database as BIGINT eg.

cc_number BIGINT NON NULL UNIQUE, — this will prevent any addition
of a duplicate cc_number
PRIMARY KEY (cc_number) — or, more simply, in one line: cc_number
BIGINT PRIMARY KEY

Obviously the csv will have to be checked to ensure the number is a
legal integer value.

There is nothing in the original post to say what is contained in the
card field. And it has not been stated if this field is also unique.
Assuming it has text then this will be more interesting to work with as
a human may see the duplication but not the machine, but if it’s only
the number that has to be unique then it is straight forward.

> Please if can have a faster query can help us in achieving this
> requirement.
>
As noted by Holgar Jacobs up thread it is not possible for a query to
be faster than checks done at time the data is inserted. The data has to
be inserted and indexes created etc. If the duplicate key constraint is
violated then the duplicate is found. Otherwise the data has to be
inserted and a query run afterwards.

Assuming there is a reasonably recent PostgreSQL hidden behind the
Django and Python, i.e., >9.5, then the INSERT can be attempted with

INSERT INTO … ON CONFLICT DO NOTHING —
https://www.postgresql.org/about/news/postgresql-131-125-1110-1015-9620-and-9524-released-2111/

Once this is in place the data will be added to the table as fast as
that is possible, and with no duplicates. It will be very much slower if
every insertion has to be checked in the Django+Python stack. Basically
if speed is needed go to the database and talk in SQL.

Regards
Gavan Schneider (who can offer no help with Dango/Python)

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Ribe 2020-12-30 13:13:07 Re: how to make duplicate finding query faster?
Previous Message Holger Jakobs 2020-12-30 11:18:06 Re: how to make duplicate finding query faster?