Re: How to insert a bulk of data with unique-violations very fast

From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: How to insert a bulk of data with unique-violations very fast
Date: 2010-06-03 20:09:38
Message-ID: AANLkTilj2xHAZMnQ-To3lod3mMPw1DBU_8EkRwTC29qb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, Jun 3, 2010 at 11:19 AM, Torsten Zühlsdorff
<foo(at)meisterderspiele(dot)de> wrote:
> Scott Marlowe schrieb:
>>
>> On Tue, Jun 1, 2010 at 9:03 AM, Torsten Zühlsdorff
>> <foo(at)meisterderspiele(dot)de> wrote:
>>>
>>> Hello,
>>>
>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>> a
>>> list of data, which contains multiple times of rows than the already
>>> stored
>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>> duplicates
>>> and often the set of already stored data.
>>> I want to store just every entry, which is not within the already stored
>>> one. Also i do not want to store duplicates. Example:
>>
>> The standard method in pgsql is to load the data into a temp table
>> then insert where not exists in old table.
>
> Sorry, i didn't get it. I've googled some examples, but no one match at my
> case. Every example i found was a single insert which should be done or
> ignored, if the row is already stored.
>
> But in my case i have a bulk of rows with duplicates. Either your tipp
> doesn't match my case or i didn't unterstand it correctly. Can you provide a
> simple example?

create table main (id int primary key, info text);
create table loader (id int, info text);
insert into main values (1,'abc'),(2,'def'),(3,'ghi');
insert into loader values (1,'abc'),(4,'xyz');
select * from main;
id | info
----+------
1 | abc
2 | def
3 | ghi
(3 rows)

select * from loader;
id | info
----+------
1 | abc
4 | xyz
(2 rows)

insert into main select * from loader except select * from main;
select * from main;
id | info
----+------
1 | abc
2 | def
3 | ghi
4 | xyz
(4 rows)

Note that for the where not exists to work the fields would need to be
all the same, or you'd need a more complex query. If the info field
here was different you'd get an error an no insert / update. For that
case you might want to use "where not in":

insert into main select * from loader where id not in (select id from main);

If you wanted the new rows to update pre-existing rows, then you could
run an update first where the ids matched, then the insert where no id
matches.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Anj Adu 2010-06-03 20:37:23 Re: slow query performance
Previous Message Kevin Grittner 2010-06-03 20:01:03 Re: Weird XFS WAL problem