Quick Links

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc:	Andy Colson <andy(at)squeakycode(dot)net>, Daniel Begin <jfd553(at)hotmail(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date:	2014-12-09 02:52:24
Message-ID:	14733.1418093544@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> If you're de-duping a whole table, no need to create indexes, as it's
> gonna have to hit every row anyway. Fastest way I've found has been:

> select a,b,c into newtable from oldtable group by a,b,c;

> On pass, done.

> If you want to use less than the whole row, you can use select
> distinct on (col1, col2) * into newtable from oldtable;

Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes. See the
"latest weather reports" example in the SELECT reference page.

In any case, it's advisable to crank up work_mem while performing this
operation.

regards, tom lane

In response to

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) at 2014-12-09 02:35:24 from Scott Marlowe

Browse pgsql-general by date

	From	Date	Subject
Next Message	Huang, Suya	2014-12-09 04:59:53	Re: FW: SQL rolling window without aggregation
Previous Message	Scott Marlowe	2014-12-09 02:35:24	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)