Quick Links

Re: Bulkloading using COPY - ignore duplicates?

From:	"Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To:	Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc:	Patrick Welche <prlw1(at)newn(dot)cam(dot)ac(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Bulkloading using COPY - ignore duplicates?
Date:	2001-12-13 16:22:59
Message-ID:	20011213102259.B10349@rice.edu
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Dec 13, 2001 at 03:44:31PM +0000, Lee Kindness wrote:
> Patrick Welche writes:
> > On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
> > > That's what I'm currently doing as a workaround - a SELECT DISTINCT
> > > from a temporary table into the real table with the unique index on
> > > it. However this takes absolute ages - say 5 seconds for the copy
> > > (which is the ballpark figure I aiming toward and can achieve with
> > > Ingres) plus another 30ish seconds for the SELECT DISTINCT.
> > Then your column really isn't unique,
>
> That's another discussion entirely ;) - it's spat out by a real-time
> system which doesn't have the time or resources to check this. Further
> precision loss later in the data's life adds more duplicates...

Hmm, the data has a later life - sounds like you'll need to remove dups
then, anyway, so can you get away with just letting the dups in? Remove
the UNIQUE requirement, and let the real time system just dump away.
How critical is it to later steps that there be no dups? And how many
(potential) dups is your RTS producing, anyway?

Your later processing (which apparently can _generate_ dups) might be
the out of the critical time path place to worry about removing dups.

Ross

P.S. This falls into the class of problem solving characterized by
"if you can't solve the problem as stated, restate the problem to be
one you _can_ solve" ;-)

>
> > so how about dropping the unique index, import the data, fix the
> > duplicates, recreate the unique index - just as another possible
> > work around ;)
>
> This is just going to be the same(ish) time, no?
>
> CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);
> COPY tab FROM 'file';
> DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2
> FROM tab);
> CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2);
>
> or am I missing something?
>
> Thanks, Lee.
>
> --
> Lee Kindness, Senior Software Engineer, Concept Systems Limited.
> http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html

In response to

Re: Bulkloading using COPY - ignore duplicates? at 2001-12-13 15:44:31 from Lee Kindness

Responses

Re: Bulkloading using COPY - ignore duplicates? at 2001-12-13 18:36:27 from Ross J. Reedstrom

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Doug McNaught	2001-12-13 17:17:44	Re: Platform testing (last call?)
Previous Message	Larry Rosenman	2001-12-13 16:22:53	Re: Platform testing (last call?)