Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From: John McKown <john(dot)archie(dot)mckown(at)gmail(dot)com>
To: Daniel Begin <jfd553(at)hotmail(dot)com>
Cc: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date: 2014-12-12 16:59:24
Message-ID: CAAJSdjiR9sUOQNqWQKZRr9BCyqVAtA_2dYVfhWKcHUGYhgFSxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Dec 12, 2014 at 9:57 AM, Daniel Begin <jfd553(at)hotmail(dot)com> wrote:

> Seems promising but could you provide me a reference to PostgreSQL
> documentation regarding this "a%8=*" feature?
> Best
>

​% is the modulus operator.

Assuming "a" is an integer (I don't remember), then doing 8 selects of "a
modulus 8" = for each of the possible results (0..7)​ will each select
about 1/8 of the entire table (I would guess) and the end result put
together, they will end up selecting all of the original table. I don't
know, myself, why this would be faster. But I'm not any kind of a
PostgreSQL expert either.

>
> Daniel
>
> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org
> [mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Marc Mamin
> Sent: December-12-14 06:41
> To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
> Cc: 'Andy Colson'; pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
> (rationale behind selecting a method)
>
>
> >Thank Tom,
> >I understand that the rationale behind choosing to create a new table
> >from distinct records is that, since both approaches need full table
> >scans, selecting distinct records is faster (and seems more straight
> >forward) than finding/deleting duplicates;
>
> Hi,
> on a large table you may get it faster while using more than one thread.
> e.g.:
>
> select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
> select
> a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
> select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;
>
> This will/should use a shared full table scan on oldtable.
>
> HTH
>
> Marc Mamin
>
>
> >
> >Best regards,
> >Daniel
> >
> >-----Original Message-----
> >From: pgsql-general-owner(at)postgresql(dot)org
> >[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
> >Sent: December-08-14 21:52
> >To: Scott Marlowe
> >Cc: Andy Colson; Daniel Begin; pgsql-general(at)postgresql(dot)org
> >Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
> >(rationale behind selecting a method)
> >
> >Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> >> If you're de-duping a whole table, no need to create indexes, as it's
> >> gonna have to hit every row anyway. Fastest way I've found has been:
> >
> >> select a,b,c into newtable from oldtable group by a,b,c;
> >
> >> On pass, done.
> >
> >> If you want to use less than the whole row, you can use select
> >> distinct on (col1, col2) * into newtable from oldtable;
> >
> >Also, the DISTINCT ON method can be refined to control which of a set
> >of duplicate keys is retained, if you can identify additional columns
> >that constitute a preference order for retaining/discarding dupes. See
> >the "latest weather reports" example in the SELECT reference page.
> >
> >In any case, it's advisable to crank up work_mem while performing this
> >operation.
> >
> > regards, tom lane
> >
> >
> >--
> >Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To
> >make changes to your subscription:
> >http://www.postgresql.org/mailpref/pgsql-general
> >
> >
> >
> >--
> >Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To
> >make changes to your subscription:
> >http://www.postgresql.org/mailpref/pgsql-general
> >
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

--

While a transcendent vocabulary is laudable, one must be eternally careful
so that the calculated objective of communication does not become ensconced
in obscurity. In other words, eschew obfuscation.

Maranatha! <><
John McKown

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jeff Janes 2014-12-12 17:18:01 Re: anyone using oid2name?
Previous Message Daniel Begin 2014-12-12 15:57:21 Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)