Re: Using database to find file doublettes in my computer

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Lothar Behrens <lothar(dot)behrens(at)lollisoft(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Using database to find file doublettes in my computer
Date: 2008-11-20 11:01:46
Message-ID: 4925439A.2010601@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Lothar Behrens wrote:

> I have expected a smaller amount of records due to the fact that for 4
> files each available 2 times (sum = 8) I
> have 8 records in ECADFiles, but must have 4 in the above result.
>
> So for an average of 2 doubles I expected half the files from
> ECADFiles, because one is exactly right and the other
> is on the left.

It's a combinatorial problem. If you have 3 duplicate files, eg:

INSERT INTO paths (path) VALUES
(E'C:\\path\\file1.txt'),
(E'C:\\path2\\file1.txt'),
(E'/path/file1.txt');

then the query process I described above will output the matches:

C:\path\file1.txt | C:\path2\file1.txt
/path/file1.txt | C:\path2\file1.txt
/path/file1.txt | C:\path\file1.txt

because while it avoids showing both (A,B) and (B,A) pairs, for any A, B
and C it'll show:

(A,B)
(A,C)
(B,C)

I've attached test SQL that does the above.

Presumably, you want to only show, say:

(A,B)
(A,C)

or maybe:

(filename, A)
(filename, B)
(filename, C)

If that's what you want, you need to work a little differently. The
attached SQL in dups_test2.sql shows one way you might do it, by
generating a list of files with duplicates then listing all the
locations each appears in. Again, you can do it without the temp table,
it'll probably just be slower. None of what I've written is particularly
fast anyway - it evaluates those regular expressions many more times
than should be required, for example.

--
Craig Ringer

Attachment Content-Type Size
dups_setup.sql text/x-sql 810 bytes
dups_test.sql text/x-sql 619 bytes
dups_test2.sql text/x-sql 266 bytes

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2008-11-20 11:09:19 Re: Res: Res: Res: Archive files growth!!!
Previous Message Richard Huxton 2008-11-20 10:36:18 Re: Serial - last value