From: | Scott Marlowe <smarlowe(at)g2switchworks(dot)com> |
---|---|
To: | George Nychis <gnychis(at)cmu(dot)edu> |
Cc: | pgsql general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: insert only unique values in to a table, ignore rest? |
Date: | 2007-01-08 21:57:23 |
Message-ID: | 1168293442.20602.141.camel@state.g2switchworks.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 2007-01-08 at 15:52, George Nychis wrote:
> Scott Marlowe wrote:
> > On Mon, 2007-01-08 at 14:58, George Nychis wrote:
> >> Hi,
> >>
> >> I have approximately 2 billion data entries that I would like to insert into a database.
> >> Each entry consists of:
> >> INT BOOLEAN INT BOOLEAN
> >>
> >> I want to populate a table such that it only contains the unique rows, all other data
> >> should be thrown out. I would say a significant amount of the insertions are going to
> >> fail due to unique constraints. The unique constraint is on the two integers, not on the
> >> booleans.
> >>
> >> Using mysql, I was able to do this with the following query, for all data files (25574
> >> data files total):
> >> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn
> >> FIELDS TERMINATED BY ' ';\"
> >
> > A quick question. Could you run selects or other inserts on that table
> > while the load data infile was running? Cause I'm guessing that it
> > basically locked the whole table while running.
> What does this have to do with my question? I don't need to run selects or inserts on the
> table while the load data is running...
>
> >
> >> What I *think* mysql did was sort each data file and do a sort of merge sort between the
> >> data I was inserting and the data in the database. It would insert the first unique
> >> instance of a row it saw, and reject all other insertions that violated the unique
> >> constraint due to the "IGNORE".
> >
> > Me too. Which would require "one big lock" on the table which would
> > mean no parallel access.
> Thats fine, it doesn't matter.
>
> >
> > It's also likely that it used a temp table which doubled the size of the
> > database while you were inserting.
> >
> >> From what I understand, this functionality is not in postgresql. Fine, I certainly can't
> >> change that. But I am looking for a comparable solution for the size of my data.
> >>
> >> One solution is to have a temporary table, insert all 2 billion rows, and then copy the
> >> distinct entries to another table. This would be like one massive sort?
> >>
> >> Is this the only/best solution using postgresql?
> >
> > TANSTAAFL. PostgreSQL is designed so that you can run an import process
> > on that table while 100 other users still access it at the same time.
> > Because of that, you don't get to do dirty, nasty things under the
> > sheets that allow for super easy data loading and merging like you got
> > with MySQL. Apples and Oranges.
> >
> > Assuming you're loading into an empty table, the load to temp, select
> > distinct out and into the final table seems reasonable, should run
> > reasonably fast. If you need to load to an existing table, it might get
> > a little more complex.
> >
>
> The goal is not to run queries while the data is being inserted....I am wondering if the
> postgresql method I have mentioned to actually insert and get only distinct values is most
> optimal, which would produce the same results method I explained in mysql.
Did I fail to answer your question?
Sorry if I gave you more information than you needed. Please feel free
to ask someone else next time.
From | Date | Subject | |
---|---|---|---|
Next Message | George Nychis | 2007-01-08 21:59:21 | Re: insert only unique values in to a table, ignore rest? |
Previous Message | George Nychis | 2007-01-08 21:52:53 | Re: insert only unique values in to a table, ignore rest? |