Quick Links

Re: Removing duplicate records from a bulk upload

From:	Andy Colson <andy(at)squeakycode(dot)net>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Removing duplicate records from a bulk upload
Date:	2014-12-08 16:30:02
Message-ID:	5485D20A.6020403@squeakycode.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 12/7/2014 9:31 PM, Daniel Begin wrote:
> I have just completed the bulk upload of a large database. Some tables
> have billions of records and no constraints or indexes have been applied
> yet. About 0.1% of these records may have been duplicated during the
> upload and I need to remove them before applying constraints.
>
> I understand there are (at least) two approaches to get a table without
> duplicate records…
>
> - Delete duplicate records from the table based on an
> appropriate select clause;
>
> - Create a new table with the results from a select distinct
> clause, and then drop the original table.
>
> What would be the most efficient procedure in PostgreSQL to do the job
> considering …
>
> - I do not know which records were duplicated;
>
> - There are no indexes applied on tables yet;
>
> - There is no OIDS on tables yet;
>
> - The database is currently 1TB but I have plenty of disk space.
>
> Daniel
>

How would you detect duplicate? Is there a single field that would be
duplicated? Or do you have to test a bunch of different fields?

If its a single field, you could find dups in a single pass of the table
with:

create index bigtable_key on bigtable(key);
select key, count(*) from bigtable group by key having count(*) > 1;

Save that list, and decide on some way of deleting the dups.

The index might help the initial select, but will really help re-query
and delete statements.

-Andy

In response to

Removing duplicate records from a bulk upload at 2014-12-08 03:31:36 from Daniel Begin

Responses

Re: Removing duplicate records from a bulk upload at 2014-12-08 16:38:39 from Andy Colson

Browse pgsql-general by date

	From	Date	Subject
Next Message	Andy Colson	2014-12-08 16:38:39	Re: Removing duplicate records from a bulk upload
Previous Message	Nguyễn Trần Quốc Vinh	2014-12-08 15:59:18	Incremental update for matview - automatic trigger-in-c generator