Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From: Daniel Begin <jfd553(at)hotmail(dot)com>
To: "'Marc Mamin'" <M(dot)Mamin(at)intershop(dot)de>, "'John McKown'" <john(dot)archie(dot)mckown(at)gmail(dot)com>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Scott Marlowe'" <scott(dot)marlowe(at)gmail(dot)com>, "'Andy Colson'" <andy(at)squeakycode(dot)net>, "'PostgreSQL General'" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date: 2014-12-12 21:22:05
Message-ID: COL129-DS3E0CE99D112E37F27231394600@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank Marc (and all others)

I knew that nothing was for free and understanding where the costs come from
would provide me with some rationale to make my choice!

However, I understand from your answer that there is no definitive approach
to do it right at this time (considering my specific context). Since the
tables are quite similar (number of fields and data types), I will then
probably explore some of the promising approaches proposed so far (below) to
get facts.

a) select distinct * into newtable from oldtable;

b) select a,b,c into newtable from oldtable group by a,b,c;

c) select distinct a,b,c into newtable from oldtable where a%6=* group by
a,b,c;

d) select a,b,c into newtable from oldtable where a%6=* group by a,b,c;

c&d will be run after having set the visibility bit with a select count.
Running on 8 cores, only 6 will be used (a%6=*)

Something else to add/replace on scenarios above (a-d) ?

Before going further, here are some context/concerns you may wish to comment
.

- Most of the tables do not fit in my computer's 32GB memory - since they
are usually between 35GB and 350GB;

- All the tables have a bigint ID I can use with the modulo operator
(id%6=*);

Working on Windows platform.

- work_mem: 1024MB

- maintenance_work_mem: 16384 (Understand the value must be significantly
larger than work_mem)

- shared_buffers: 128MB (Understand that on Windows the useful range is 64MB
to 512MB)

Waiting comments and hoping to get back soon with useful results

Daniel

From: Marc Mamin [mailto:M(dot)Mamin(at)intershop(dot)de]
Sent: December-12-14 14:25
To: John McKown; Daniel Begin
Cc: Tom Lane; Scott Marlowe; Andy Colson; PostgreSQL General
Subject: AW: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

>> Seems promising but could you provide me a reference to PostgreSQL
>> documentation regarding this "a%8=*" feature?
>> Best

>
> % is the modulus operator.
>
>Assuming "a" is an integer (I don't remember), then doing 8 selects of "a
modulus 8" = for each of the possible results (0..7)? will each select about
1/8 of the entire table (I would guess) and the end result put together,
they will end up selecting all of the original table. I don't know, myself,
why this would be faster. But I'm not any kind of a PostgreSQL expert
either.

yes.
Extracting unique values from very large sets is not for free,
neither from the I/O nor from the cpu point of view
spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also
consider
(hashtext(value))/%, but the extra work may result in no benefit.
hashtext is not documented as it is not garanteed to stay stables in future
Postgres release,
but is safe in such a case
(http://lmgtfy.com/?q=postgres+hashtext+partition)

Another point is that I'm not sure that all threads will grep on a shared
scan on a freshly created table
where the visiblity hint bit is not yet set:

(see head comment in
http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/

You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new
COPY FREEZE option:
http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm,
on the other hand, I suppose that you will avoid the extra I/O for the hint
bit
if you first copy your data in a temp table, but then you won't be able to
parallelize the job
as other transactions won't see the data.
Moreover you need to pay attention to how much work_mem you can afford to
each transaction,
knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.
I'm using this approach in aggregations which are more cpu intensive than a
simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin

>
> Daniel
>
> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org
> [mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Marc Mamin
> Sent: December-12-14 06:41
> To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
> Cc: 'Andy Colson'; pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
> (rationale behind selecting a method)
>
>
> >Thank Tom,
> >I understand that the rationale behind choosing to create a new table
> >from distinct records is that, since both approaches need full table
> >scans, selecting distinct records is faster (and seems more straight
> >forward) than finding/deleting duplicates;
>
> Hi,
> on a large table you may get it faster while using more than one
thread.
> e.g.:
>
> select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select
> a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
> select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;
>
> This will/should use a shared full table scan on oldtable.
>
> HTH
>
> Marc Mamin
>
>
> >
> >Best regards,
> >Daniel
> >
> >-----Original Message-----
> >From: pgsql-general-owner(at)postgresql(dot)org
> >[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
> >Sent: December-08-14 21:52
> >To: Scott Marlowe
> >Cc: Andy Colson; Daniel Begin; pgsql-general(at)postgresql(dot)org
> >Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
> >(rationale behind selecting a method)
> >
> >Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> writes:
> >> If you're de-duping a whole table, no need to create indexes, as
it's
> >> gonna have to hit every row anyway. Fastest way I've found has been:
> >
> >> select a,b,c into newtable from oldtable group by a,b,c;
> >
> >> On pass, done.
> >
> >> If you want to use less than the whole row, you can use select
> >> distinct on (col1, col2) * into newtable from oldtable;
> >
> >Also, the DISTINCT ON method can be refined to control which of a set
> >of duplicate keys is retained, if you can identify additional columns
> >that constitute a preference order for retaining/discarding dupes.
See
> >the "latest weather reports" example in the SELECT reference page.
> >
> >In any case, it's advisable to crank up work_mem while performing this
> >operation.
> >
> > regards, tom lane
> >
> >
> >--
> >Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To
> >make changes to your subscription:
> >http://www.postgresql.org/mailpref/pgsql-general
> >
> >
> >
> >--
> >Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To
> >make changes to your subscription:
> >http://www.postgresql.org/mailpref/pgsql-general
> >
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To
make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>--
>?
>While a transcendent vocabulary is laudable, one must be eternally careful
so that the calculated objective of communication does not become ensconced
in obscurity. In other words, eschew obfuscation.
>
>Maranatha! <><
>John McKown
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jonathan Vanasco 2014-12-12 21:40:23 function indexes, index only scan and sorting
Previous Message Alan Nilsson 2014-12-12 21:11:51 Re: pgxs question - linking c-functions to external libraries