Re: Using database to find file doublettes in my computer

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Lothar Behrens <lothar(dot)behrens(at)lollisoft(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Using database to find file doublettes in my computer
Date: 2008-11-18 06:40:23
Message-ID: 49226357.9000407@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Lothar Behrens wrote:

> But how to query for files to display a 'left / right view' for each
> file that is on multible places ?

One approach is to use a query to extract the names of all files with
duplicates, and store the results in a TEMPORARY table with a UNIQUE
index (or PRIMARY KEY) on the filename.

You then self-inner-join your paths table to its self, with the join
condition being that the filename part of the path EXISTS in the table
of files having duplicates. You should also filter out rows where the
first filename is <= the second filename to avoid outputting (a,b) and
(b,a) for each duplicate, and to avoid outputting (a,a) and (b,b) rows.

You can do much the same thing in a single query without the temp table,
but I think you'll need a correlated subquery to check for duplicates
(or a join on a subquery that'll really expand the results to be
processed), so the temp table approach is probably going to be lots faster.

Quick example code (untested but should give you the idea), assumes you
have a function get_filename(...) that extracts just the filename part
of the path:

CREATE TABLE paths (
path PRIMARY KEY,
--- other fields
);

-- Populate paths

-- Once paths is populated, extract duplicates:

SELECT get_filename(path) AS fn, count(path) AS n
FROM paths HAVING count(path) > 1
INTO TEMPORARY TABLE dup_files;

-- Creates UNIQUE index on PATH as well
ALTER TABLE dup_files ADD CONSTRAINT PRIMARY KEY (path);

-- Now build your side-by-side table of duplicates:

SELECT p1.path, p2.path
FROM paths p1 INNER JOIN paths p2
ON (get_filename(p1.path) = get_filename(p2.path))
WHERE EXISTS(SELECT 1 FROM dup_files WHERE fn = get_filename(p1.path))
AND p1.path > p2.path;

You can replace get_filename(fn) with appropriate code, but I'd write a
quick SQL function marked IMMUTABLE to neatly wrap up the pathname
extraction instead.

--
Craig Ringer

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2008-11-18 06:42:33 Re: How to reduce impact of a query.
Previous Message Scott Marlowe 2008-11-18 06:26:51 Re: How to reduce impact of a query.