Re: Import large data set into a table and resolve duplicates?

From: John McKown <john(dot)archie(dot)mckown(at)gmail(dot)com>
To: Eugene Dzhurinsky <jdevelop(at)gmail(dot)com>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Import large data set into a table and resolve duplicates?
Date: 2015-02-15 16:00:50
Message-ID: CAAJSdjjf6+YyLaOdpGdAzjwN9EurQ9SavKTTGoSTsk_kVsddog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Feb 14, 2015 at 3:54 PM, Eugene Dzhurinsky <jdevelop(at)gmail(dot)com>
wrote:

> Hello!
>
> I have a huge dictionary table with series data generated by a third-party
> service. The table consists of 2 columns
>
> - id : serial, primary key
> - series : varchar, not null, indexed
>
> From time to time I need to apply a "patch" to the dictionary, the patch
> file
> consists of "series" data, one per line.
>
> Now I need to import the patch into the database, and produce another file
> as
> - if the passed "series" field exists in the database, then return
> ID:series
> - otherwise insert a new row to the table and generate new ID and return
> ID:series
> for each row in the source file.
>
> So the new file will contain both ID and series data, separated by tab or
> something.
>
> While reading and writing the data is not a question (I simply described
> the
> whole task just in case), I wonder what is the most efficient way of
> importing
> such a data into a table, keeping in mind that
>
> - the dictionary table already consists of ~200K records
> - the patch could be ~1-50K of records long
> - records could not be removed from the dictionary, only added if not exist
>
> Thanks!
>
> --
> Eugene Dzhurinsky
>

​I was hoping that you'd get a good reply from someone else. But it is the
weekend. So I will _try_ to explain what I would try. You can see if it is
of any use to you. Sometimes my ideas are really good. And other times they
are, well let's be nice and say "not as good as other times" [grin/]. I
can't test the below because I don't have any data. But hopefully it will
be close and even of some help to you.

The first thing that I would do is put the input patch data into its own
table. Perhaps a temporary table, or even a permanent one.

DROP TABLE IF EXISTS patch_data;
CREATE TABLE patch_data (
input_record_number SERIAL,
already_exists BOOLEAN DEFAULT FALSE;
id INTEGER,
series TEXT ​NOT NULL );

​I don't know if the order of the records​ in output file need to match the
order of the records in the input file. If not, they you don't need the
"input_record_number" field. Otherwise, that is used to maintain the order
of the input. At this point, you can load the input file with an SQL
command similar to:

COPY patch_data (series) FROM input-file.txt;

Now update the path_data from the existing dictionary table to see which
"series" data already exists.

UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE
dictionary.series = patch_data.series));

At this point, the table patch_data has been updated such that if the
series data in it already exists, the "already_exists" column is now TRUE
instead of the initial FALSE. This means that we need to insert all the
series data in "patch_data" which does not exist in "dictionary" ( i.e.
"already_exists" is FALSE in "patch_data") into "dictionary".

INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE
already_exists = FALSE;

The above should insert the "series" records which don't exist in
"dictionary" into it and generate an "id" column for it from your SERIAL
definition. Now we need to find out the "id" values for each of the
"series" values and return them.

UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE
dictionary.series = patch_data.series));

At this point, if I have not messed up, every "series" value in
"patch_data" should have the proper "id" value from "dictionary". So the
table "patch_data" should now have all that we need in it. So we just use
it to make our output file. I don't know you're output requirements, but
I'd just do something like:

SELECT id, e'\t',series FROM patch_data ORDER BY input_record_number;

Again, if the output order does not need to be the same as the input order,
then all the stuff with the "input_record_number" column can be eliminated.
Just to be a good citizen:

DROP TABLE patch_data;

at this point. Unless, you think you might need it for some reason. Who
knows, you might even want to archive it as some sort of audit information.

--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John McKown 2015-02-15 16:04:22 Re: Import large data set into a table and resolve duplicates?
Previous Message Eugene Dzhurinsky 2015-02-14 21:54:47 Import large data set into a table and resolve duplicates?