Quick Links

Removing duplicate records from a bulk upload

From:	Daniel Begin <jfd553(at)hotmail(dot)com>
To:	<pgsql-general(at)postgresql(dot)org>
Subject:	Removing duplicate records from a bulk upload
Date:	2014-12-08 03:31:36
Message-ID:	COL129-DS2247F5657A1B6114A3819194640@phx.gbl
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

I have just completed the bulk upload of a large database. Some tables have billions of records and no constraints or indexes have been applied yet. About 0.1% of these records may have been duplicated during the upload and I need to remove them before applying constraints.

I understand there are (at least) two approaches to get a table without duplicate records…

- Delete duplicate records from the table based on an appropriate select clause;

- Create a new table with the results from a select distinct clause, and then drop the original table.

What would be the most efficient procedure in PostgreSQL to do the job considering …

- I do not know which records were duplicated;

- There are no indexes applied on tables yet;

- There is no OIDS on tables yet;

- The database is currently 1TB but I have plenty of disk space.

Daniel

Responses

Re: Removing duplicate records from a bulk upload at 2014-12-08 16:30:02 from Andy Colson

Browse pgsql-general by date

	From	Date	Subject
Next Message	Eric Svenson	2014-12-08 08:21:28	Re: Fwd: Fwd: Problem with pg_dump and decimal mark
Previous Message	David G Johnston	2014-12-08 02:18:11	Re: FW: SQL rolling window without aggregation