Re: Need efficient way to remove (almost) duplicate rows from a table

From: Paul Jungwirth <pj(at)illuminatedcomputing(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Need efficient way to remove (almost) duplicate rows from a table
Date: 2023-10-17 23:09:17
Message-ID: c6e4f48d-3af4-4d65-4031-b1e447e09214@illuminatedcomputing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/17/23 15:48, David Gauthier wrote:
> I have a table with millions of records and there are a lot of "almost"
> duplicates that I want to get rid of in an efficient way.  Best to
> illustrate with a simple example...

This sounds like a classic use-case for a "correlated subquery".
Assuming you want to keep the lowest twig_id, you could do this:

```
paul=# select * from twigs;
limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
1 | 1 | 1 | 100
1 | 1 | 2 | 103.7
1 | 1 | 3 | 103.7
1 | 1 | 4 | 110.4
1 | 2 | 1 | 120
1 | 2 | 2 | 123.6
1 | 2 | 3 | 128.1
1 | 2 | 4 | 128.1
1 | 2 | 5 | 128.1
2 | 1 | 1 | 100
2 | 1 | 3 | 104.4
2 | 1 | 4 | 131.9
(12 rows)

Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id,
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id >
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
1 | 1 | 1 | 100
1 | 1 | 2 | 103.7
1 | 1 | 4 | 110.4
1 | 2 | 1 | 120
1 | 2 | 2 | 123.6
1 | 2 | 3 | 128.1
2 | 1 | 1 | 100
2 | 1 | 3 | 104.4
2 | 1 | 4 | 131.9
(9 rows)

Time: 0.231 ms
```

That says "delete from t1 where there is a matching t2 with a lower
twig_id."

If you want to keep the highest-numbered twig_id, the sql is easy to adjust.

Regards,

--
Paul ~{:-)
pj(at)illuminatedcomputing(dot)com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2023-10-18 01:20:04 Re: Inheritance in PostgreSQL
Previous Message Ron 2023-10-17 23:06:41 Re: Need efficient way to remove (almost) duplicate rows from a table