Quick Links

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From:	Jonathan Vanasco <postgres(at)2xlp(dot)com>
To:	PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date:	2014-12-12 21:46:13
Message-ID:	A8038B1A-B4FD-4D61-A1B8-DB80BE3AB002@2xlp.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Dec 8, 2014, at 9:35 PM, Scott Marlowe wrote:

> select a,b,c into newtable from oldtable group by a,b,c;
>
> On pass, done.

This is a bit naive, but couldn't this approach potentially be faster (depending on the system)?

SELECT a, b, c INTO duplicate_records FROM ( SELECT a, b, c, count(*) AS counted FROM source_table GROUP BY a, b, c ) q_inner WHERE q_inner.counted > 1;
DELETE FROM source_table USING duplicate_records WHERE source_table.a = duplicate_records.a AND source_table.b = duplicate_records.b AND source_table.c = duplicate_records.c;

It would require multiple full table scans, but it would minimize the writing to disk -- and isn't a 'read' operation usually much more efficient than a 'write' operation? If the duplicate checking is only done on a small subset of columns, indexes could speed things up too.

In response to

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) at 2014-12-12 21:22:05 from Daniel Begin

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2014-12-12 21:58:37	Re: function indexes, index only scan and sorting
Previous Message	Jonathan Vanasco	2014-12-12 21:40:23	function indexes, index only scan and sorting