Re: insert only unique values in to a table, ignore rest?

From: George Nychis <gnychis(at)cmu(dot)edu>
To: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: insert only unique values in to a table, ignore rest?
Date: 2007-01-08 21:52:53
Message-ID: 45A2BD35.5090404@cmu.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Scott Marlowe wrote:
> On Mon, 2007-01-08 at 14:58, George Nychis wrote:
>> Hi,
>>
>> I have approximately 2 billion data entries that I would like to insert into a database.
>> Each entry consists of:
>> INT BOOLEAN INT BOOLEAN
>>
>> I want to populate a table such that it only contains the unique rows, all other data
>> should be thrown out. I would say a significant amount of the insertions are going to
>> fail due to unique constraints. The unique constraint is on the two integers, not on the
>> booleans.
>>
>> Using mysql, I was able to do this with the following query, for all data files (25574
>> data files total):
>> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
>> FIELDS TERMINATED BY ' ';\"
>
> A quick question. Could you run selects or other inserts on that table
> while the load data infile was running? Cause I'm guessing that it
> basically locked the whole table while running.
What does this have to do with my question? I don't need to run selects or inserts on the
table while the load data is running...

>
>> What I *think* mysql did was sort each data file and do a sort of merge sort between the
>> data I was inserting and the data in the database. It would insert the first unique
>> instance of a row it saw, and reject all other insertions that violated the unique
>> constraint due to the "IGNORE".
>
> Me too. Which would require "one big lock" on the table which would
> mean no parallel access.
Thats fine, it doesn't matter.

>
> It's also likely that it used a temp table which doubled the size of the
> database while you were inserting.
>
>> From what I understand, this functionality is not in postgresql. Fine, I certainly can't
>> change that. But I am looking for a comparable solution for the size of my data.
>>
>> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
>> distinct entries to another table. This would be like one massive sort?
>>
>> Is this the only/best solution using postgresql?
>
> TANSTAAFL. PostgreSQL is designed so that you can run an import process
> on that table while 100 other users still access it at the same time.
> Because of that, you don't get to do dirty, nasty things under the
> sheets that allow for super easy data loading and merging like you got
> with MySQL. Apples and Oranges.
>
> Assuming you're loading into an empty table, the load to temp, select
> distinct out and into the final table seems reasonable, should run
> reasonably fast. If you need to load to an existing table, it might get
> a little more complex.
>

The goal is not to run queries while the data is being inserted....I am wondering if the
postgresql method I have mentioned to actually insert and get only distinct values is most
optimal, which would produce the same results method I explained in mysql.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Scott Marlowe 2007-01-08 21:57:23 Re: insert only unique values in to a table, ignore rest?
Previous Message Scott Marlowe 2007-01-08 21:49:16 Re: insert only unique values in to a table, ignore rest?