Re: Most efficient way to insert without duplicates

From: François Beausoleil <francois(at)teksol(dot)info>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Forums postgresql <pgsql-general(at)postgresql(dot)org>
Subject: Re: Most efficient way to insert without duplicates
Date: 2013-04-24 16:04:38
Message-ID: A35148D6-07F4-4E11-84FD-7FDC0F19C011@teksol.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi!

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

> On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois(at)teksol(dot)info> wrote:
>
>
> Insert on public.persona_followers (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
> Buffers: shared hit=33135295 read=4776921
> -> Subquery Scan on t1 (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)
>
>
> It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.
>
> So I think that index maintenance is killing you. You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)

To close out the thread, the final results are in http://explain.depesz.com/s/xOe :

Insert on public.persona_followers (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1)
Buffers: shared hit=206, local hit=1 read=105
-> Sort (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at))
Sort Key: persona_followers_import.follower_id
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=176, local hit=1 read=105
-> HashAggregate (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at)
Buffers: shared hit=176, local hit=1 read=105
-> Bitmap Heap Scan on pg_temp_35.persona_followers_import (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at
Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=176, local hit=1 read=105
-> Bitmap Index Scan on persona_followers_import_service_id (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1)
Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
Buffers: local hit=1 read=40
SubPlan 1
-> Bitmap Heap Scan on public.persona_followers (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1)
Output: public.persona_followers.follower_id
Recheck Cond: (public.persona_followers.service_id = 362513855)
Buffers: shared hit=176
-> Bitmap Index Scan on persona_followers_pkey (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1)
Index Cond: (public.persona_followers.service_id = 362513855)
Buffers: shared hit=88
Total runtime: 19.917 ms

Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this:

CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint );
COPY TO persona_followers_import FROM stdin;
...
\.

CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id);

service_ids := SELECT DISTINCT service_id FROM persona_followers_import;
for each service_id in service_ids:
EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )
INSERT INTO persona_followers(service_id, follower_id, valid_at)
SELECT service_id, follower_id, MIN(valid_at)
FROM persona_followers_import
WHERE follower_id IS NOT NULL
AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id)
AND service_id = :service_id
GROUP BY service_id, follower_id
ORDER BY follower_id

This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data.

Thanks for all suggestions, and PostgreSQL rocks!
François Beausoleil

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Amador Alvarez 2013-04-24 16:39:47 Re: Most efficient way to insert without duplicates
Previous Message François Beausoleil 2013-04-24 15:49:58 CLUSTER, REINDEX and VACUUM on batch ops