From: | Paul Jungwirth <pj(at)illuminatedcomputing(dot)com> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Need efficient way to remove (almost) duplicate rows from a table |
Date: | 2023-10-17 23:09:17 |
Message-ID: | c6e4f48d-3af4-4d65-4031-b1e447e09214@illuminatedcomputing.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 10/17/23 15:48, David Gauthier wrote:
> I have a table with millions of records and there are a lot of "almost"
> duplicates that I want to get rid of in an efficient way. Best to
> illustrate with a simple example...
This sounds like a classic use-case for a "correlated subquery".
Assuming you want to keep the lowest twig_id, you could do this:
```
paul=# select * from twigs;
limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
1 | 1 | 1 | 100
1 | 1 | 2 | 103.7
1 | 1 | 3 | 103.7
1 | 1 | 4 | 110.4
1 | 2 | 1 | 120
1 | 2 | 2 | 123.6
1 | 2 | 3 | 128.1
1 | 2 | 4 | 128.1
1 | 2 | 5 | 128.1
2 | 1 | 1 | 100
2 | 1 | 3 | 104.4
2 | 1 | 4 | 131.9
(12 rows)
Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id,
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id >
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
1 | 1 | 1 | 100
1 | 1 | 2 | 103.7
1 | 1 | 4 | 110.4
1 | 2 | 1 | 120
1 | 2 | 2 | 123.6
1 | 2 | 3 | 128.1
2 | 1 | 1 | 100
2 | 1 | 3 | 104.4
2 | 1 | 4 | 131.9
(9 rows)
Time: 0.231 ms
```
That says "delete from t1 where there is a matching t2 with a lower
twig_id."
If you want to keep the highest-numbered twig_id, the sql is easy to adjust.
Regards,
--
Paul ~{:-)
pj(at)illuminatedcomputing(dot)com
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2023-10-18 01:20:04 | Re: Inheritance in PostgreSQL |
Previous Message | Ron | 2023-10-17 23:06:41 | Re: Need efficient way to remove (almost) duplicate rows from a table |