Re: How to insert a bulk of data with unique-violations very fast

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Torsten Zühlsdorff <foo(at)meisterderspiele(dot)de>, pgsql-performance(at)postgresql(dot)org
Subject: Re: How to insert a bulk of data with unique-violations very fast
Date: 2010-06-06 12:45:34
Message-ID: 4C0B986E.2060403@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i
> get a list of data, which contains multiple times of rows than the
> already stored one. Often around 2.000.000.000 rows. Within this rows
> are many duplicates and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import
> the new data with copy into an table 'import'. then i remove the
> duplicates and insert every row which is not already known. after that
> import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if
> the unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
> Tabelle »public.urls«
> Spalte | Typ | Attribute
> --------+---------+-------------------------------------------------------
> url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
> url | text | not null
> Indexe:
> »urls_url« UNIQUE, btree (url)
> »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)
>
> Greetings from Germany,
> Torsten

I do this with a stored procedure. I do not care about speed because my db is really small and I only insert a few records a month. So I dont know how fast this is, but here is my func:

CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) RETURNS character varying
AS $$
declare
tmp integer;
begin
insert into blood(adate, alevel) values(idate, ilevel);
return 'ok';
exception
when unique_violation then
select into tmp alevel from blood where adate = idate;
if tmp <> ilevel then
return idate || ' levels differ!';
else
return 'ok, already in table';
end if;
end; $$
LANGUAGE plpgsql;

Use it like, select * from addentry('2010-006-06 8:00:00', 130);

I do an extra check that if the date's match that the level's match too, but you wouldnt have to. There is a unique index on adate.

-Andy

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2010-06-06 17:46:52 Re: How to insert a bulk of data with unique-violations very fast
Previous Message Torsten Zühlsdorff 2010-06-06 12:05:44 Re: How to insert a bulk of data with unique-violations very fast