Re: COPY from .csv File and Remove Duplicates

From: "David Johnston" <polobo(at)yahoo(dot)com>
To: "'Rich Shepard'" <rshepard(at)appl-ecosys(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY from .csv File and Remove Duplicates
Date: 2011-08-12 02:32:37
Message-ID: 014b01cc5898$1a4981c0$4edc8540$@yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>> There is no true key, only an artificial key so I can ensure that rows
are
>> unique. That's in the main table with the 50K rows. No key column in the
>> .csv file.

If you have no true key then you have no way to ensure uniqueness. By
adding an artificial key two records that are otherwise duplicates would now
be considered unique.

Since you have not given data/structure for either the CSV or Main Table
more specific help is not possible but when using a serialized PK in almost
every case the table should also have a candidate key with a UNIQUE index
defined. If you cannot explain why yours does not, and why it cannot, I
would offer that you need to gain further understanding of your data model.
It is generally wise to create a UNIQUE index on a candidate key and risk
being wrong. At least you will be given an actual error and, in the worst
case, can always drop the UNIQUE index if indeed the "duplicate" record
should be valid; though in that situation you now have more data to input
into you model analysis and should be able to correctly modify the table to
create a new candidate key.

Slight tangent but I have an external accounting source where I know that,
with respect to the available data, duplicates can occur (a PK field is not
available). Since I have no candidate key I am forced to use an artificial
(serial) key and take extra precautions to ensure I do not inadvertently
introduce unintentional duplicate data during import. In my case I handle
data at the "day" level. My source gives me every transaction for a given
date and I then modify my live table to add only the correct number of
records so that, after the merge process, I have an exact duplicate of the
data in the source file. Thus, since I trust the source file (and cannot
enter data via any other method), I know immediately after processing that
any duplicates on a given date are expected duplicates as opposed to, say,
me accidentally importing the same file twice and thus having twice as many
records. I also understand that if, say for reconciliation purposes, I need
to choose one of a duplicate record it does not matter, initially, which one
I choose but afterwards, if I only add records, I can ensure that I always
pick the same record in the future. However, if I am forced to "DELETE" a
record, from a practical perspective I DELETE BOTH/ALL of the records and
then ADD back the correct number of records for that date. Any data that
cared about the original records will now need to decide how to handle the
fact that their record may no longer be present (instead of deleting only
some of the existing records at random without knowing which ones are the
"correct" ones to delete).

This is one example I've come across where the data I am working with has
absolutely NO inherent PK that I can see but where I can trust that, for a
given dataset, I only have valid data. I did have to assign a SERIAL PK to
my copy of the data but I also recognized where problems could occur and
mitigated them via specific processing routines. One alternative solution
would be to simply DELETE everything for a given date and then import every
record from the source file into the main table. I rejected that solution
because I could not afford to continually delete the existing records as
other tables claimed FK relationships to them and continually breaking (ON
DELETE SET NULL) them was unacceptable. I still have to do so when I need
to delete a record (rare given this is accounting data) but simply adding a
new record does not affect existing records.

Whether this situation mirrors yours I do not know but I hope this brief
description is at least informative and educational for you and others.
Feedback/Thoughts are greatly welcomed.

>> I presume what you call a staging table is what I refer to as a copy of
>> the main table, but with no key attribute.

>> Writing the SELECT statement to delete from the staging table those rows
>> that already exist in the main table is where I'm open to suggestions.

The big question to ask is how you would be able to identify a record in the
CSV file as already being on the main table (either directly or, as my above
example, indirectly)?

My use of "staging table" reflects the fact that the structure of the table
should roughly match the CSV file and NOT the "main table". The SQL you
issue to move records from the staging table to the main table will then
account for any differences between the two.

The general idea is to load up the staging table, optionally update
"matching" records on the main table, insert non-matching records, then
truncate/clear the staging table.

The general structure for the insert would be:

INSERT INTO maintable (cols)
SELECT cols FROM staging WHERE staging.idcols NOT IN (SELECT
maintable.idcols FROM maintable);

There may be more efficient ways to write the query but the idea is the
same.

David J.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2011-08-12 03:04:20 Re: COPY from .csv File and Remove Duplicates
Previous Message Craig Ringer 2011-08-12 02:14:39 Re: COPY from .csv File and Remove Duplicates