Quick Links

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From:	Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
To:	Daniel Begin <jfd553(at)hotmail(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Scott Marlowe'" <scott(dot)marlowe(at)gmail(dot)com>
Cc:	"'Andy Colson'" <andy(at)squeakycode(dot)net>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date:	2014-12-12 11:40:54
Message-ID:	B6F6FD62F2624C4C9916AC0175D56D8828B25070@jenmbs01.ad.intershop.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

>Thank Tom,
>I understand that the rationale behind choosing to create a new table from
>distinct records is that, since both approaches need full table scans,
>selecting distinct records is faster (and seems more straight forward) than
>finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread. e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select a,b,c into newtable from oldtable where a%8 =1 group by a,b,c;
...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin

>
>Best regards,
>Daniel
>
>-----Original Message-----
>From: pgsql-general-owner(at)postgresql(dot)org
>[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
>Sent: December-08-14 21:52
>To: Scott Marlowe
>Cc: Andy Colson; Daniel Begin; pgsql-general(at)postgresql(dot)org
>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>(rationale behind selecting a method)
>
>Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
>> If you're de-duping a whole table, no need to create indexes, as it's
>> gonna have to hit every row anyway. Fastest way I've found has been:
>
>> select a,b,c into newtable from oldtable group by a,b,c;
>
>> On pass, done.
>
>> If you want to use less than the whole row, you can use select
>> distinct on (col1, col2) * into newtable from oldtable;
>
>Also, the DISTINCT ON method can be refined to control which of a set of
>duplicate keys is retained, if you can identify additional columns that
>constitute a preference order for retaining/discarding dupes. See the
>"latest weather reports" example in the SELECT reference page.
>
>In any case, it's advisable to crank up work_mem while performing this
>operation.
>
> regards, tom lane
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To make
>changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) at 2014-12-09 14:58:18 from Daniel Begin

Browse pgsql-general by date

	From	Date	Subject
Next Message	Peter Eisentraut	2014-12-12 14:18:53	anyone using oid2name?
Previous Message	Eric Svenson	2014-12-12 10:07:17	Re: Fwd: Fwd: Problem with pg_dump and decimal mark