Re: How to insert a bulk of data with unique-violations very fast

From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: How to insert a bulk of data with unique-violations very fast
Date: 2010-06-03 23:03:37
Message-ID: AANLkTim0X5AyDz6K7uPdwCwoqNNSOjAUEsVJ9maS8aO0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2010/6/1 Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i get a
> list of data, which contains multiple times of rows than the already stored
> one. Often around 2.000.000.000 rows. Within this rows are many duplicates
> and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import the
> new data with copy into an table 'import'. then i remove the duplicates and
> insert every row which is not already known. after that import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if the
> unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
>                         Tabelle »public.urls«
>  Spalte |   Typ   |                       Attribute
> --------+---------+-------------------------------------------------------
>  url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
>  url    | text    | not null
> Indexe:
>    »urls_url« UNIQUE, btree (url)
>    »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)

I think you need to have a look at pgloader. It does COPY with error
handling. very effective.

http://pgloader.projects.postgresql.org/

>
> Greetings from Germany,
> Torsten
> --
> http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
> verschiedenen Datenbanksystemen abstrahiert,
> Queries von Applikationen trennt und automatisch die Query-Ergebnisse
> auswerten kann.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Anj Adu 2010-06-04 01:45:30 slow query
Previous Message Anj Adu 2010-06-03 20:37:23 Re: slow query performance