From: | Tarlika Elisabeth Schmitz <postgresql(at)numerixtechnology(dot)de> |
---|---|
To: | pgsql-sql(at)postgresql(dot)org |
Subject: | identifying duplicates in table with redundancies |
Date: | 2010-09-23 21:39:40 |
Message-ID: | 20100923223940.5019475e@dick.coachhouse |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
I loaded data from a spread into a interim table so I can analyze the
quality of the data.
The table contains an entry for every student (250K records) and his
trainer. Eventually, I want to extract a unique list of trainers from
it. But first of all I want to check for duplicates:
1) multiples trainer names for same trainer id
2) multiple trainer ids for same trainer name
I cobbled together the SQL and it does the job but it seems rather
convoluted. I would like to know how I can improve it.
CREATE TABLE student (
id INTEGER NOT NULL,
name VARCHAR(256) NOT NULL,
trainer_id INTEGER,
trainer_name VARCHAR(256),
);
====
EXAMPLE DATA
22 John 1 Macdonald
23 Jane 1 MacDonald
24 Paul 1 MacDonald
25 Dick 2 Smith
26 Bill 3 Smith
27 Kate 3 Smith
====
-- outputs trainer ids which appear under different names
select trainer_id, trainer_name from
(
-- different id/name combinations
select distinct on (trainer_name) trainer_id, trainer_name
from student
where trainer_id in
(
-- trainer ids with appearing with different names
select distinct on (id) id
from
(
-- distinct trainer id-name
select distinct on (trainer_id,trainer_name)
trainer_id as id,
trainer_name as name from student
) as trainer
group by trainer.id
having count (trainer.name) > 1
)
) as y
order by trainer_id
--
Best Regards,
Tarlika Elisabeth Schmitz
From | Date | Subject | |
---|---|---|---|
Next Message | Nicholas I | 2010-09-24 09:15:54 | find and replace the string within a column |
Previous Message | Rob Sargent | 2010-09-23 18:46:55 | Re: pg_config -less |