From: | Daniel Begin <jfd553(at)hotmail(dot)com> |
---|---|
To: | "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Scott Marlowe'" <scott(dot)marlowe(at)gmail(dot)com> |
Cc: | "'Andy Colson'" <andy(at)squeakycode(dot)net>, <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) |
Date: | 2014-12-09 14:58:18 |
Message-ID: | COL129-DS22F3AC2E7068560F72605B94650@phx.gbl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Thank Tom,
I understand that the rationale behind choosing to create a new table from
distinct records is that, since both approaches need full table scans,
selecting distinct records is faster (and seems more straight forward) than
finding/deleting duplicates;
Best regards,
Daniel
-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
Sent: December-08-14 21:52
To: Scott Marlowe
Cc: Andy Colson; Daniel Begin; pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)
Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> If you're de-duping a whole table, no need to create indexes, as it's
> gonna have to hit every row anyway. Fastest way I've found has been:
> select a,b,c into newtable from oldtable group by a,b,c;
> On pass, done.
> If you want to use less than the whole row, you can use select
> distinct on (col1, col2) * into newtable from oldtable;
Also, the DISTINCT ON method can be refined to control which of a set of
duplicate keys is retained, if you can identify additional columns that
constitute a preference order for retaining/discarding dupes. See the
"latest weather reports" example in the SELECT reference page.
In any case, it's advisable to crank up work_mem while performing this
operation.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
From | Date | Subject | |
---|---|---|---|
Next Message | lin | 2014-12-09 15:29:14 | pg_restore -n sch1 : schema "sch1" does not exist |
Previous Message | Albe Laurenz | 2014-12-09 10:24:47 | Re: Use cases for lateral that do not involve a set returning function |