From: | Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: How to insert a bulk of data with unique-violations very fast |
Date: | 2010-06-06 12:02:20 |
Message-ID: | hug2od$nu4$1@news.eternal-september.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Scott Marlowe schrieb:
>>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>>> a
>>>> list of data, which contains multiple times of rows than the already
>>>> stored
>>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>>> duplicates
>>>> and often the set of already stored data.
>>>> I want to store just every entry, which is not within the already stored
>>>> one. Also i do not want to store duplicates. Example:
>>> The standard method in pgsql is to load the data into a temp table
>>> then insert where not exists in old table.
>> Sorry, i didn't get it. I've googled some examples, but no one match at my
>> case. Every example i found was a single insert which should be done or
>> ignored, if the row is already stored.
>>
>> But in my case i have a bulk of rows with duplicates. Either your tipp
>> doesn't match my case or i didn't unterstand it correctly. Can you provide a
>> simple example?
>
> create table main (id int primary key, info text);
> create table loader (id int, info text);
> insert into main values (1,'abc'),(2,'def'),(3,'ghi');
> insert into loader values (1,'abc'),(4,'xyz');
> select * from main;
> id | info
> ----+------
> 1 | abc
> 2 | def
> 3 | ghi
> (3 rows)
>
> select * from loader;
> id | info
> ----+------
> 1 | abc
> 4 | xyz
> (2 rows)
>
> insert into main select * from loader except select * from main;
> select * from main;
> id | info
> ----+------
> 1 | abc
> 2 | def
> 3 | ghi
> 4 | xyz
> (4 rows)
>
> Note that for the where not exists to work the fields would need to be
> all the same, or you'd need a more complex query. If the info field
> here was different you'd get an error an no insert / update. For that
> case you might want to use "where not in":
>
> insert into main select * from loader where id not in (select id from main);
Thank you very much for your example. Now i've got it :)
I've test your example on a small set of my rows. While testing i've
stumpled over a difference in sql-formulation. Using except seems to be
a little slower than the more complex where not in (subquery) group by.
Here is my example:
CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
value
-------
a
b
c
(3 rows)
SELECT* from tmpseq;
value
-------
a
b
c
a
b
c
a
b
d
d
b
d
(12 rows)
VACUUM VERBOSE ANALYSE;
explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..2.34 rows=4 width=2) (actual
time=0.157..0.158 rows=1 loops=1)
-> Append (cost=0.00..2.30 rows=15 width=2) (actual
time=0.012..0.126 rows=15 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..1.24 rows=12
width=2) (actual time=0.009..0.060 rows=12 loops=1)
-> Seq Scan on tmpseq (cost=0.00..1.12 rows=12
width=2) (actual time=0.004..0.022 rows=12 loops=1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..1.06 rows=3
width=2) (actual time=0.006..0.018 rows=3 loops=1)
-> Seq Scan on tseq (cost=0.00..1.03 rows=3 width=2)
(actual time=0.003..0.009 rows=3 loops=1)
Total runtime: 0.216 ms
(7 rows)
explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT
value FROM tseq) GROUP BY value;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2.20..2.22 rows=2 width=2) (actual
time=0.053..0.055 rows=1 loops=1)
-> Seq Scan on tmpseq (cost=1.04..2.19 rows=6 width=2) (actual
time=0.038..0.043 rows=3 loops=1)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on tseq (cost=0.00..1.03 rows=3 width=2)
(actual time=0.004..0.009 rows=3 loops=1)
Total runtime: 0.105 ms
(6 rows)
My question: is this an generall behavior or just an effect of the small
case?
Greetings form Germany,
Torsten
From | Date | Subject | |
---|---|---|---|
Next Message | Torsten Zühlsdorff | 2010-06-06 12:05:44 | Re: How to insert a bulk of data with unique-violations very fast |
Previous Message | Craig Ringer | 2010-06-06 07:48:19 | Re: How filesystems matter with PostgreSQL |