Re: insert only unique values in to a table, ignore rest?

From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: George Nychis <gnychis(at)cmu(dot)edu>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: insert only unique values in to a table, ignore rest?
Date: 2007-01-08 21:49:16
Message-ID: 1168292956.20602.136.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 2007-01-08 at 14:58, George Nychis wrote:
> Hi,
>
> I have approximately 2 billion data entries that I would like to insert into a database.
> Each entry consists of:
> INT BOOLEAN INT BOOLEAN
>
> I want to populate a table such that it only contains the unique rows, all other data
> should be thrown out. I would say a significant amount of the insertions are going to
> fail due to unique constraints. The unique constraint is on the two integers, not on the
> booleans.
>
> Using mysql, I was able to do this with the following query, for all data files (25574
> data files total):
> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
> FIELDS TERMINATED BY ' ';\"

A quick question. Could you run selects or other inserts on that table
while the load data infile was running? Cause I'm guessing that it
basically locked the whole table while running.

> What I *think* mysql did was sort each data file and do a sort of merge sort between the
> data I was inserting and the data in the database. It would insert the first unique
> instance of a row it saw, and reject all other insertions that violated the unique
> constraint due to the "IGNORE".

Me too. Which would require "one big lock" on the table which would
mean no parallel access.

It's also likely that it used a temp table which doubled the size of the
database while you were inserting.

> From what I understand, this functionality is not in postgresql. Fine, I certainly can't
> change that. But I am looking for a comparable solution for the size of my data.
>
> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
> distinct entries to another table. This would be like one massive sort?
>
> Is this the only/best solution using postgresql?

TANSTAAFL. PostgreSQL is designed so that you can run an import process
on that table while 100 other users still access it at the same time.
Because of that, you don't get to do dirty, nasty things under the
sheets that allow for super easy data loading and merging like you got
with MySQL. Apples and Oranges.

Assuming you're loading into an empty table, the load to temp, select
distinct out and into the final table seems reasonable, should run
reasonably fast. If you need to load to an existing table, it might get
a little more complex.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message George Nychis 2007-01-08 21:52:53 Re: insert only unique values in to a table, ignore rest?
Previous Message Raymond O'Donnell 2007-01-08 21:32:39 Re: Slony in Windows installer?