Re: Import large data set into a table and resolve duplicates?

From: John McKown <john(dot)archie(dot)mckown(at)gmail(dot)com>
To: Eugene Dzhurinsky <jdevelop(at)gmail(dot)com>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Import large data set into a table and resolve duplicates?
Date: 2015-02-15 16:04:22
Message-ID: CAAJSdjgksStGtVY5YjaHh=TLhC3=MaSpox+Ptxaewi3hb5eE+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

OOPS, I forgot to mention in the SELECT generating the output file that the
e'\t' generates a "tab" character. You likely already know this, but I like
to make my posts as self contained as possible.

On Sun, Feb 15, 2015 at 10:00 AM, John McKown <john(dot)archie(dot)mckown(at)gmail(dot)com>
wrote:

> On Sat, Feb 14, 2015 at 3:54 PM, Eugene Dzhurinsky <jdevelop(at)gmail(dot)com>
> wrote:
>
>> Hello!
>>
>> I have a huge dictionary table with series data generated by a third-party
>> service. The table consists of 2 columns
>>
>> - id : serial, primary key
>> - series : varchar, not null, indexed
>>
>> From time to time I need to apply a "patch" to the dictionary, the patch
>> file
>> consists of "series" data, one per line.
>>
>> Now I need to import the patch into the database, and produce another
>> file as
>> - if the passed "series" field exists in the database, then return
>> ID:series
>> - otherwise insert a new row to the table and generate new ID and return
>> ID:series
>> for each row in the source file.
>>
>> So the new file will contain both ID and series data, separated by tab or
>> something.
>>
>> While reading and writing the data is not a question (I simply described
>> the
>> whole task just in case), I wonder what is the most efficient way of
>> importing
>> such a data into a table, keeping in mind that
>>
>> - the dictionary table already consists of ~200K records
>> - the patch could be ~1-50K of records long
>> - records could not be removed from the dictionary, only added if not
>> exist
>>
>> Thanks!
>>
>> --
>> Eugene Dzhurinsky
>>
>
> ​I was hoping that you'd get a good reply from someone else. But it is the
> weekend. So I will _try_ to explain what I would try. You can see if it is
> of any use to you. Sometimes my ideas are really good. And other times they
> are, well let's be nice and say "not as good as other times" [grin/]. I
> can't test the below because I don't have any data. But hopefully it will
> be close and even of some help to you.
>
> The first thing that I would do is put the input patch data into its own
> table. Perhaps a temporary table, or even a permanent one.
>
> DROP TABLE IF EXISTS patch_data;
> CREATE TABLE patch_data (
> input_record_number SERIAL,
> already_exists BOOLEAN DEFAULT FALSE;
> id INTEGER,
> series TEXT ​NOT NULL );
>
> ​I don't know if the order of the records​ in output file need to match
> the order of the records in the input file. If not, they you don't need the
> "input_record_number" field. Otherwise, that is used to maintain the order
> of the input. At this point, you can load the input file with an SQL
> command similar to:
>
> COPY patch_data (series) FROM input-file.txt;
>
> Now update the path_data from the existing dictionary table to see which
> "series" data already exists.
>
> UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE
> dictionary.series = patch_data.series));
>
> At this point, the table patch_data has been updated such that if the
> series data in it already exists, the "already_exists" column is now TRUE
> instead of the initial FALSE. This means that we need to insert all the
> series data in "patch_data" which does not exist in "dictionary" ( i.e.
> "already_exists" is FALSE in "patch_data") into "dictionary".
>
> INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE
> already_exists = FALSE;
>
> The above should insert the "series" records which don't exist in
> "dictionary" into it and generate an "id" column for it from your SERIAL
> definition. Now we need to find out the "id" values for each of the
> "series" values and return them.
>
> UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE
> dictionary.series = patch_data.series));
>
> At this point, if I have not messed up, every "series" value in
> "patch_data" should have the proper "id" value from "dictionary". So the
> table "patch_data" should now have all that we need in it. So we just use
> it to make our output file. I don't know you're output requirements, but
> I'd just do something like:
>
> SELECT id, e'\t',series FROM patch_data ORDER BY input_record_number;
>
> Again, if the output order does not need to be the same as the input
> order, then all the stuff with the "input_record_number" column can be
> eliminated. Just to be a good citizen:
>
> DROP TABLE patch_data;
>
> at this point. Unless, you think you might need it for some reason. Who
> knows, you might even want to archive it as some sort of audit information.
>
>
> --
> He's about as useful as a wax frying pan.
>
> 10 to the 12th power microphones = 1 Megaphone
>
> Maranatha! <><
> John McKown
>

--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Rafal Pietrak 2015-02-15 17:14:46 Re: partial "on-delete set null" constraint
Previous Message John McKown 2015-02-15 16:00:50 Re: Import large data set into a table and resolve duplicates?