Re: How to insert a bulk of data with unique-violations very fast

From: Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: How to insert a bulk of data with unique-violations very fast
Date: 2010-06-06 12:02:20
Message-ID: hug2od$nu4$1@news.eternal-september.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Scott Marlowe schrieb:

>>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>>> a
>>>> list of data, which contains multiple times of rows than the already
>>>> stored
>>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>>> duplicates
>>>> and often the set of already stored data.
>>>> I want to store just every entry, which is not within the already stored
>>>> one. Also i do not want to store duplicates. Example:
>>> The standard method in pgsql is to load the data into a temp table
>>> then insert where not exists in old table.
>> Sorry, i didn't get it. I've googled some examples, but no one match at my
>> case. Every example i found was a single insert which should be done or
>> ignored, if the row is already stored.
>>
>> But in my case i have a bulk of rows with duplicates. Either your tipp
>> doesn't match my case or i didn't unterstand it correctly. Can you provide a
>> simple example?
>
> create table main (id int primary key, info text);
> create table loader (id int, info text);
> insert into main values (1,'abc'),(2,'def'),(3,'ghi');
> insert into loader values (1,'abc'),(4,'xyz');
> select * from main;
> id | info
> ----+------
> 1 | abc
> 2 | def
> 3 | ghi
> (3 rows)
>
> select * from loader;
> id | info
> ----+------
> 1 | abc
> 4 | xyz
> (2 rows)
>
> insert into main select * from loader except select * from main;
> select * from main;
> id | info
> ----+------
> 1 | abc
> 2 | def
> 3 | ghi
> 4 | xyz
> (4 rows)
>
> Note that for the where not exists to work the fields would need to be
> all the same, or you'd need a more complex query. If the info field
> here was different you'd get an error an no insert / update. For that
> case you might want to use "where not in":
>
> insert into main select * from loader where id not in (select id from main);

Thank you very much for your example. Now i've got it :)

I've test your example on a small set of my rows. While testing i've
stumpled over a difference in sql-formulation. Using except seems to be
a little slower than the more complex where not in (subquery) group by.
Here is my example:

CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
value
-------
a
b
c
(3 rows)

SELECT* from tmpseq;
value
-------
a
b
c
a
b
c
a
b
d
d
b
d
(12 rows)

VACUUM VERBOSE ANALYSE;

explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..2.34 rows=4 width=2) (actual
time=0.157..0.158 rows=1 loops=1)
-> Append (cost=0.00..2.30 rows=15 width=2) (actual
time=0.012..0.126 rows=15 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..1.24 rows=12
width=2) (actual time=0.009..0.060 rows=12 loops=1)
-> Seq Scan on tmpseq (cost=0.00..1.12 rows=12
width=2) (actual time=0.004..0.022 rows=12 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..1.06 rows=3
width=2) (actual time=0.006..0.018 rows=3 loops=1)
-> Seq Scan on tseq (cost=0.00..1.03 rows=3 width=2)
(actual time=0.003..0.009 rows=3 loops=1)
Total runtime: 0.216 ms
(7 rows)

explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT
value FROM tseq) GROUP BY value;
QUERY PLAN

------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2.20..2.22 rows=2 width=2) (actual
time=0.053..0.055 rows=1 loops=1)
-> Seq Scan on tmpseq (cost=1.04..2.19 rows=6 width=2) (actual
time=0.038..0.043 rows=3 loops=1)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on tseq (cost=0.00..1.03 rows=3 width=2)
(actual time=0.004..0.009 rows=3 loops=1)
Total runtime: 0.105 ms
(6 rows)

My question: is this an generall behavior or just an effect of the small
case?

Greetings form Germany,
Torsten

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Torsten Zühlsdorff 2010-06-06 12:05:44 Re: How to insert a bulk of data with unique-violations very fast
Previous Message Craig Ringer 2010-06-06 07:48:19 Re: How filesystems matter with PostgreSQL