Re: Using database to find file doublettes in my computer

From: Lothar Behrens <lothar(dot)behrens(at)lollisoft(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Using database to find file doublettes in my computer
Date: 2008-11-19 15:52:39
Message-ID: dc61ab60-2816-478c-9d53-ea9a2a3dcb30@b31g2000prb.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 18 Nov., 07:40, cr(dot)(dot)(dot)(at)postnewspapers(dot)com(dot)au (Craig Ringer) wrote:

> -- Once paths is populated, extract duplicates:
>
> SELECT get_filename(path) AS fn, count(path) AS n
> FROM paths HAVING count(path) > 1
> INTO TEMPORARY TABLE dup_files;
>
> -- Creates UNIQUE index on PATH as well
> ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);
>
> -- Now build your side-by-side table of duplicates:
>
> SELECT p1.path, p2.path
> FROM paths p1 INNER JOIN paths p2
>      ON (get_filename(p1.path) = get_filename(p2.path))
> WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
>   AND p1.path > p2.path;
>
> You can replace get_filename(fn) with appropriate code, but I'd write a
> quick SQL function marked IMMUTABLE to neatly wrap up the pathname
> extraction instead.
>

Hi Craig,

I have done the steps as you described. I have about 14000 files with
an md5sum.
Based on the full filename I have updated the md5sum in my base table
'ECADFiles'.

With the following query I see about 2900 files that are available
multible times:

select "Name", count("Pfad") As n
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1

Using this query I see 13000 double files as a sum:

select sum(n) from (
select "Name", count("Pfad") As n
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1) as temp

Using the following query I get ~ 129000 records:

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
ON (p1."Name" = p2."Name")
where EXISTS (select 1 from "Datei" where "Name" = p1."Name" AND
"Anzahl" > 1)
and p1."Datei" > p2."Datei"

I have expected a smaller amount of records due to the fact that for 4
files each available 2 times (sum = 8) I
have 8 records in ECADFiles, but must have 4 in the above result.

So for an average of 2 doubles I expected half the files from
ECADFiles, because one is exactly right and the other
is on the left.

In general this results in about the same or less records than 13000
multible files.

Why did I get these 129000 records ?

I assume a rotating from files on the left to the right, thus about n
- 1 times too much records. Thus I have tested this:

select sum(n), sum(r) from (
select "Name", count("Pfad") As n, count("Pfad") * (count("Pfad") - 1)
As r
from "ECADFiles"
Group by "Name"
having count("Pfad") > 1
) as temp

But I got 259240. This is probably not correct.

Testing that with the first left file to search on right I get the n
occurences at all as of n double files. Assuming there are
n - 1 too much, I have got my expected result by changing the last AND
rule to the opposite:

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
ON (p1."Name" = p2."Name")
where p1."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\
\MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name"
AND "Anzahl" > 1)
and p1."Datei" > p2."Datei"

Gives 7 records with 7 different right files and the 8th on the left.

select p1."Datei", p1."MD5Sum", p1."ID", p2."Datei", p2."MD5Sum",
p2."ID"
from "ECADFiles" p1 INNER JOIN "ECADFiles" p2
ON (p1."Name" = p2."Name")
where p2."Datei" = E'C:\\drefsrechneralt\\Drefs511_2\\EAGLE\
\MW211.SCH' AND EXISTS (select 1 from "Datei" where "Name" = p1."Name"
AND "Anzahl" > 1)
and p1."Datei" < p2."Datei"

Gives 7 records with 7 different left files and the 8th on the right.

Any ideas how to remove these unwanted records ?
This seems not to be easy, because I see problems araising when
changing the data a cursor runs over.

Thanks

Lothar

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sam Mason 2008-11-19 15:58:32 Re: return MAX and when it happened
Previous Message Grzegorz Jaśkiewicz 2008-11-19 15:52:28 Re: tracking down a warning