Re: How to remove duplicate lines but save one of the lines?

From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "A B" <gentosaker(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: How to remove duplicate lines but save one of the lines?
Date: 2008-07-21 16:29:10
Message-ID: dcc563d10807210929j6a68eeb2id57e7ca3b3e74cbe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Jul 21, 2008 at 9:51 AM, A B <gentosaker(at)gmail(dot)com> wrote:
>> There is probably a more elegant way of doing it, but a simple way of doing
>> it ( depending on the size of the table ) could be:
>>
>> begin;
>>
>> insert into foo select distinct * from orig_table;
>> delete from orig_table;
>> insert into orig_table select * from foo;
>>
>> commit;
>
> Just to make it clear to me
> Here foo is a table that I have to create with the command
> CREATE TABLE foo (....same columns as orig_table);
> ?

If this is a live table with that you can't use that method on, you
can use this generic methodology to get rid of dups.

-- Create test table
smarlowe=# create table main (i int, t text);
CREATE TABLE
smarlowe=# insert into main values (1,'A');
INSERT 0 1
smarlowe=# insert into main values (1,'A');
INSERT 0 1
smarlowe=# insert into main values (3,'B');
INSERT 0 1
smarlowe=# insert into main values (3,'B');
INSERT 0 1
smarlowe=# insert into main values (44,'C');
INSERT 0 1
smarlowe=# insert into main values (44,'C');
INSERT 0 1
smarlowe=# select * from main;
i | t
----+---
1 | A
1 | A
3 | B
3 | B
44 | C
44 | C
(6 rows)

Add a new field for an int, set it to a sequence of numbers:

smarlowe=# alter table main add uniq int;
ALTER TABLE
smarlowe=# create sequence t
smarlowe-# ;
CREATE SEQUENCE
smarlowe=# update main set uniq=nextval('t');
UPDATE 6
smarlowe=# select * from main;
i | t | uniq
----+---+------
1 | A | 1
1 | A | 2
3 | B | 3
3 | B | 4
44 | C | 5
44 | C | 6
(6 rows)

This query will give us a list of "extra" ids:

smarlowe=# select distinct m1.uniq from main m1 join main m2 on
(m1.t=m2.t and m1.i=m2.i) where m1.uniq > m2.uniq;
uniq
------
2
4
6
(3 rows)

We use that query as a subselect to a delete:

smarlowe=# begin;
BEGIN
smarlowe=# delete from main where uniq in (select m1.uniq from main m1
join main m2 on (m1.t=m2.t and m1.i=m2.i) where m1.uniq > m2.uniq);
DELETE 3
smarlowe=# select * from main;
i | t | uniq
----+---+------
1 | A | 1
3 | B | 3
44 | C | 5
(3 rows)
smarlowe=# commit;
COMMIT

> Is it possible to add a unique constraint to the table, with a
> "delete" option so it will delete duplicates?

It is possible to add a unique constraint. Having it "delete" rows
automagically is not normal operation, but I'm sure some kind of user
defined trigger could be written to do that. But if you've got a
unique constraint on a unique set of data, new non-unique entries will
fail to enter.

smarlowe=# create unique index main_t_i on main (t,i);
CREATE INDEX
smarlowe=# insert into main (i,t) values (1,'A');
ERROR: duplicate key violates unique constraint "main_t_i"

In response to

Browse pgsql-general by date

  From Date Subject
Next Message brian 2008-07-21 16:45:53 Re: Calling Python functions with parameters
Previous Message Hoover, Jeffrey 2008-07-21 16:14:48 Re: How to remove duplicate lines but save one of the lines?