Quick Links

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

From:	Andy Colson <andy(at)squeakycode(dot)net>
To:	Daniel Begin <jfd553(at)hotmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Date:	2014-12-09 01:22:59
Message-ID:	54864EF3.2020603@squeakycode.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 12/08/2014 03:59 PM, Daniel Begin wrote:
> Thanks for your answers Andy; I will keep in mind the procedure you proposed.
> About the fields required to find duplicate records, all of them are required (5-9) depending on the table.
>
> Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table scan?
> - Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate records;
> - Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from big_table;
>
> Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from distinct records? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what the criteria should be and how/why it would affect the selection of an approach:-|
>
> Daniel
>
> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org [mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Andy Colson
> Sent: December-08-14 11:39
> To: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>
> On 12/8/2014 10:30 AM, Andy Colson wrote:
>> On 12/7/2014 9:31 PM, Daniel Begin wrote:
>>> I have just completed the bulk upload of a large database. Some
>>> tables have billions of records and no constraints or indexes have
>>> been applied yet. About 0.1% of these records may have been
>>> duplicated during the upload and I need to remove them before applying constraints.
>>>
>>> I understand there are (at least) two approaches to get a table
>>> without duplicate records…
>>>
>>> - Delete duplicate records from the table based on an
>>> appropriate select clause;
>>>
>>> - Create a new table with the results from a select distinct
>>> clause, and then drop the original table.
>>>
>>> What would be the most efficient procedure in PostgreSQL to do the
>>> job considering …
>>>
>>> - I do not know which records were duplicated;
>>>
>>> - There are no indexes applied on tables yet;
>>>
>>> - There is no OIDS on tables yet;
>>>
>>> - The database is currently 1TB but I have plenty of disk
>>> space.
>>>
>>> Daniel
>>>
>>
>> How would you detect duplicate? Is there a single field that would be
>> duplicated? Or do you have to test a bunch of different fields?
>>
>> If its a single field, you could find dups in a single pass of the
>> table
>> with:
>>
>> create index bigtable_key on bigtable(key); select key, count(*) from
>> bigtable group by key having count(*) > 1;
>>
>> Save that list, and decide on some way of deleting the dups.
>>
>> The index might help the initial select, but will really help re-query
>> and delete statements.
>>
>> -Andy
>>
>>
>
> I just thought of a more generic way.
>
> 1) make a non-unique index on bigtable
> 2) make a temp table
> 3) -- copy only dups
> insert into temp table
> select * from big table where (its a duplicate);
>
> 4)
> delete from bigtable where keys in (select key from temp);
>
> 5)
> insert into bigtable
> select distinct from temp;
>
>
> This would minimize the amount of data you have to move around. Depends on how hard step 3 is to write. Index not required but would help both step 3 and 4 be faster.
>
> -Andy
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>

I was assuming you could find dups using a single field. The single field could be indexed, and the single field (via index) could find dups, as well as delete rows.

If you have to use all the columns ... well, I'd ignore everything I said. :-)

-Andy

In response to

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) at 2014-12-08 21:59:56 from Daniel Begin

Responses

Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) at 2014-12-09 02:35:24 from Scott Marlowe

Browse pgsql-general by date

	From	Date	Subject
Next Message	Scott Marlowe	2014-12-09 02:35:24	Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
Previous Message	Vincent de Phily	2014-12-09 00:58:11	Re: Speeding up an in-progress wraparound-preventing vacuum