Re: COPY from .csv File and Remove Duplicates

From: David Johnston <polobo(at)yahoo(dot)com>
To: Rich Shepard <rshepard(at)appl-ecosys(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: COPY from .csv File and Remove Duplicates
Date: 2011-08-11 23:50:19
Message-ID: 6A1FF13C-7526-4296-A132-B4F4988F5E17@yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Aug 11, 2011, at 19:13, Rich Shepard <rshepard(at)appl-ecosys(dot)com> wrote:

> A table has a sequence to generate a primary key for inserted records with
> NULLs in that column.
>
> I have a .csv file of approximately 10k rows to copy into this table. My
> two questions which have not been answered by reference to my postgres
> reference book or Google searches are:
>
> 1) Will the sequence automatically add the nextval() to each new record as
> the copy command runs?
>
> 2) Many of these rows almost certainly are already in the table. I would
> like to remove duplicates either during the COPY command or immediately
> after. I'm considering copying the new data into a clone of the table then
> running a SELECT to add only those rows in the new cloned table to the
> existing table.
>
> Suggestions on getting these data in without duplicating existing rows
> will be very helpful.
>

If you have duplicates with matching real keys inserting into a staging table and then moving new records to the final table is your best option (in general it is better to do a two-step with a staging table since you can readily use Postgresql to perform any intermediate translations) As for the import itself, I believe if the column with the sequence is present in the csv the sequence will not be used and, if no value is present, a null will be stored for that column - causing any not-null constraint to throw an error. In this case I would just import the data to a staging table without any kind of artificial key, just the true key, and then during the merge with the live table you simply omit the pk field from the insert statement and the sequence will kick in at that point.

David J.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2011-08-11 23:54:37 Re: Regex Query Index question
Previous Message David Johnston 2011-08-11 23:35:43 Re: Regex Query Index question