From: | Scott Mead <scottm(at)openscg(dot)com> |
---|---|
To: | Rich Shepard <rshepard(at)appl-ecosys(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Recommended Protocol: Adding Rows to Table |
Date: | 2011-11-25 20:00:50 |
Message-ID: | CAKq0gv+eCtxN1jP2TDCYj+MDp-OTkdGMkdQiL7pavWJeDvh2Fw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, Nov 25, 2011 at 11:05 AM, Rich Shepard <rshepard(at)appl-ecosys(dot)com>wrote:
> The data originated in a spreadsheet and, based on my experience, contains
> duplicate records. After reformatting there are 143,260 rows to insert in
> the table. The approach I tried seems to have problems (explained below)
> and
> I would like to learn the proper way to insert rows in either an empty
> table
> or one with existing rows since I'll need to do this procedure for my
> projects.
>
> The table was created with the primary key and I used INSERT INTO ... to
> load the data. Many duplicate records, so I split the file into smaller
> ones
> and re-ran the command to load them. I'd then remove the reported duplicate
> rows from the source (text) file. Between runs, I'd issue the DROP TABLE
> command within psql and check it was empty using 'select count(*) from
> waterchem;'.
>
> It appeared that I removed duplicates from the first couple of smaller
> files so I combined them into one file named ok.sql. But, when I tested the
> combined file it, too, reported many duplicate records. Something wrong
> here.
>
> Perhaps a better approach is to put the CREATE TABLE command above the
> INSERT INTO section of the file (without specifying a PK), load that using
> the command 'psql -d <database> -f waterchem.sql', then add the PK and
> remove duplicates as postgres presents them. Or, perhaps there is a much
> more efficient way to accomplish this task.
>
> Rather than my flailing around and spending a lot of time failing to load
> all non-duplicate rows into the table I'd like to learn the proper way to
> accomplish this task. Suggestions, recommendations, and your experiences
> are
> requested.
>
Why don't you first load the data into a table (no primary key), then use
SQL to find your dups?
once loaded:
SELECT <primary_key_column>, count(1) from <table> group by 1 having
count(1) > 1;
At least then, you'll really know what you're in for. You can either
script a DELETE or... whatever you want to do, once clean, you can add the
PK.
--
Scott
>
> TIA,
>
> Rich
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/**mailpref/pgsql-general<http://www.postgresql.org/mailpref/pgsql-general>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Alpha Beta | 2011-11-25 20:04:38 | .dmp files in postgresql |
Previous Message | Adrian Klaver | 2011-11-25 19:17:15 | Re: Recommended Protocol: Adding Rows to Table |