From: | Andy Colson <andy(at)squeakycode(dot)net> |
---|---|
To: | Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: How to insert a bulk of data with unique-violations very fast |
Date: | 2010-06-06 12:45:34 |
Message-ID: | 4C0B986E.2060403@squeakycode.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i
> get a list of data, which contains multiple times of rows than the
> already stored one. Often around 2.000.000.000 rows. Within this rows
> are many duplicates and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import
> the new data with copy into an table 'import'. then i remove the
> duplicates and insert every row which is not already known. after that
> import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if
> the unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
> Tabelle »public.urls«
> Spalte | Typ | Attribute
> --------+---------+-------------------------------------------------------
> url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
> url | text | not null
> Indexe:
> »urls_url« UNIQUE, btree (url)
> »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)
>
> Greetings from Germany,
> Torsten
I do this with a stored procedure. I do not care about speed because my db is really small and I only insert a few records a month. So I dont know how fast this is, but here is my func:
CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) RETURNS character varying
AS $$
declare
tmp integer;
begin
insert into blood(adate, alevel) values(idate, ilevel);
return 'ok';
exception
when unique_violation then
select into tmp alevel from blood where adate = idate;
if tmp <> ilevel then
return idate || ' levels differ!';
else
return 'ok, already in table';
end if;
end; $$
LANGUAGE plpgsql;
Use it like, select * from addentry('2010-006-06 8:00:00', 130);
I do an extra check that if the date's match that the level's match too, but you wouldnt have to. There is a unique index on adate.
-Andy
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Marlowe | 2010-06-06 17:46:52 | Re: How to insert a bulk of data with unique-violations very fast |
Previous Message | Torsten Zühlsdorff | 2010-06-06 12:05:44 | Re: How to insert a bulk of data with unique-violations very fast |