Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From: Daniel Begin <jfd553(at)hotmail(dot)com>
To: "'Andy Colson'" <andy(at)squeakycode(dot)net>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date: 2014-12-08 21:59:56
Message-ID: COL129-DS838ED023AE6FCC4446A2494640@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for your answers Andy; I will keep in mind the procedure you proposed.
About the fields required to find duplicate records, all of them are required (5-9) depending on the table.

Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table scan?
- Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate records;
- Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from big_table;

Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from distinct records? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what the criteria should be and how/why it would affect the selection of an approach:-|

Daniel

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Andy Colson
Sent: December-08-14 11:39
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload

On 12/8/2014 10:30 AM, Andy Colson wrote:
> On 12/7/2014 9:31 PM, Daniel Begin wrote:
>> I have just completed the bulk upload of a large database. Some
>> tables have billions of records and no constraints or indexes have
>> been applied yet. About 0.1% of these records may have been
>> duplicated during the upload and I need to remove them before applying constraints.
>>
>> I understand there are (at least) two approaches to get a table
>> without duplicate records…
>>
>> - Delete duplicate records from the table based on an
>> appropriate select clause;
>>
>> - Create a new table with the results from a select distinct
>> clause, and then drop the original table.
>>
>> What would be the most efficient procedure in PostgreSQL to do the
>> job considering …
>>
>> - I do not know which records were duplicated;
>>
>> - There are no indexes applied on tables yet;
>>
>> - There is no OIDS on tables yet;
>>
>> - The database is currently 1TB but I have plenty of disk
>> space.
>>
>> Daniel
>>
>
> How would you detect duplicate? Is there a single field that would be
> duplicated? Or do you have to test a bunch of different fields?
>
> If its a single field, you could find dups in a single pass of the
> table
> with:
>
> create index bigtable_key on bigtable(key); select key, count(*) from
> bigtable group by key having count(*) > 1;
>
> Save that list, and decide on some way of deleting the dups.
>
> The index might help the initial select, but will really help re-query
> and delete statements.
>
> -Andy
>
>

I just thought of a more generic way.

1) make a non-unique index on bigtable
2) make a temp table
3) -- copy only dups
insert into temp table
select * from big table where (its a duplicate);

4)
delete from bigtable where keys in (select key from temp);

5)
insert into bigtable
select distinct from temp;

This would minimize the amount of data you have to move around. Depends on how hard step 3 is to write. Index not required but would help both step 3 and 4 be faster.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2014-12-08 23:22:02 Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Previous Message Adrian Klaver 2014-12-08 21:57:29 Re: Fwd: Fwd: Problem with pg_dump and decimal mark