Re: Most efficient way to insert without duplicates

From: François Beausoleil <francois(at)teksol(dot)info>
To: François Beausoleil <francois(at)teksol(dot)info>
Cc: Forums postgresql <pgsql-general(at)postgresql(dot)org>
Subject: Re: Most efficient way to insert without duplicates
Date: 2013-04-17 11:26:27
Message-ID: 82853183-91B9-4301-B4A9-603D8047FBBE@teksol.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Le 2013-04-16 à 22:51, François Beausoleil a écrit :

> Hi all!
>
> I track Twitter followers in my database. I have the following table:
>
> # \d persona_followers
> Table "public.persona_followers"
> Column | Type | Modifiers
> -------------+-----------------------------+-----------
> service_id | bigint | not null
> follower_id | bigint | not null
> valid_at | timestamp without time zone |
> Indexes:
> "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)
>
> The table IS NOT partitioned.

A run with the following query:

INSERT INTO "persona_followers"
SELECT "service_id", "follower_id", now()
FROM (
SELECT *
FROM (
SELECT DISTINCT "service_id", "follower_id"
FROM "persona_followers_import"
WHERE "service_id" IS NOT NULL OR "follower_id" IS NOT NULL
EXCEPT
SELECT "service_id", "follower_id" FROM "persona_followers") AS "t1") AS "t1"

results in http://explain.depesz.com/s/Y1c

Insert on public.persona_followers (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
Buffers: shared hit=33135295 read=4776921
-> Subquery Scan on t1 (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)
Output: t1.service_id, t1.follower_id, now()
Buffers: shared hit=36891 read=3572263
-> HashSetOp Except (cost=139261.12..20389650.18 rows=6256498 width=16) (actual time=562265.127..566513.759 rows=6819520 loops=1)
Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, (0)
Buffers: shared hit=36891 read=3572263
-> Append (cost=139261.12..17054024.97 rows=667125042 width=16) (actual time=4090.462..320879.545 rows=667689321 loops=1)
Buffers: shared hit=36891 read=3572263
-> Subquery Scan on "*SELECT* 1" (cost=139261.12..264391.09 rows=6256498 width=16) (actual time=4090.461..7798.334 rows=6820784 loops=1)
Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, 0
Buffers: shared hit=36891
-> HashAggregate (cost=139261.12..201826.11 rows=6256498 width=16) (actual time=4090.459..6795.009 rows=6820784 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id
Buffers: shared hit=36891
-> Seq Scan on francois.persona_followers_import (cost=0.00..105137.75 rows=6824675 width=16) (actual time=0.017..1344.916 rows=6824700 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id
Filter: ((persona_followers_import.service_id IS NOT NULL) OR (persona_followers_import.follower_id IS NOT NULL))
Buffers: shared hit=36891
-> Subquery Scan on "*SELECT* 2" (cost=0.00..16789633.88 rows=660868544 width=16) (actual time=6.694..238761.499 rows=660868537 loops=1)
Output: "*SELECT* 2".service_id, "*SELECT* 2".follower_id, 1
Buffers: shared read=3572263
-> Seq Scan on public.persona_followers (cost=0.00..10180948.44 rows=660868544 width=16) (actual time=6.693..137929.808 rows=660868537 loops=1)
Output: public.persona_followers.service_id, public.persona_followers.follower_id
Buffers: shared read=3572263
Total runtime: 4729338.157 ms

1h20m for 6.8 million rows inserted. The estimates are spot on, since I had just run VACUUM ANALYZE on both tables prior to doing this test. Running the original query now, SELECT * FROM ... WHERE NOT EXISTS(...).

Bye,
François

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Aleksey Tsalolikhin 2013-04-17 13:23:41 How large can a PostgreSQL database get?
Previous Message Albe Laurenz 2013-04-17 07:18:13 Re: dataset lock