From: | Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com> |
---|---|
To: | Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: How to insert a bulk of data with unique-violations very fast |
Date: | 2010-06-03 23:03:37 |
Message-ID: | AANLkTim0X5AyDz6K7uPdwCwoqNNSOjAUEsVJ9maS8aO0@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
2010/6/1 Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i get a
> list of data, which contains multiple times of rows than the already stored
> one. Often around 2.000.000.000 rows. Within this rows are many duplicates
> and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import the
> new data with copy into an table 'import'. then i remove the duplicates and
> insert every row which is not already known. after that import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if the
> unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
> Tabelle »public.urls«
> Spalte | Typ | Attribute
> --------+---------+-------------------------------------------------------
> url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
> url | text | not null
> Indexe:
> »urls_url« UNIQUE, btree (url)
> »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)
I think you need to have a look at pgloader. It does COPY with error
handling. very effective.
http://pgloader.projects.postgresql.org/
>
> Greetings from Germany,
> Torsten
> --
> http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
> verschiedenen Datenbanksystemen abstrahiert,
> Queries von Applikationen trennt und automatisch die Query-Ergebnisse
> auswerten kann.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
From | Date | Subject | |
---|---|---|---|
Next Message | Anj Adu | 2010-06-04 01:45:30 | slow query |
Previous Message | Anj Adu | 2010-06-03 20:37:23 | Re: slow query performance |