Quick Links

Re: Bulkloading using COPY - ignore duplicates?

From:	"Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To:	Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Bulkloading using COPY - ignore duplicates?
Date:	2001-12-13 18:36:27
Message-ID:	20011213123627.B11073@rice.edu
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

O.K., time to start looking into the _nature_ of the dups in your
data, to see if there's anything specific to take advantage of, since
the general solution (tell the DBMS to ignore dups) isn't available,
and isn't likely to get there real soon.

So what does your data look like, and how do the dups occur?

Any chance it's in a really simple format, and the dups are also really
simple, like 'one record per line, dups occur as identical adjacent
lines?' if so, 'uniq' will solve the problem with little to no speed
penalty. (it's the sort that kills ...)

Or are you only gettinga dup'ed field,m and the rule 'ignore later
records?' I could see this happen if the dta is timestamped at a
granularity that doesn't _exactly_ match the repetition rate: e.g.
stamp to the second, record once a second.

So, what's it look like? Since it's one format, I bet a small, simple
pipe filter could handle dup elimination on the fly.

Ross

On Thu, Dec 13, 2001 at 05:02:15PM +0000, Lee Kindness wrote:
>
> The RTS outputs to a file which is then subsequently used as input to
> other packages, one of which is the application i'm concerned
> with. While fixing at source is the ideal solution there are terabytes
> of legacy data around (this is raw seismic navigational data). Also
> there are more than one competing packages...
>
> Our package post-processes (we're still very concerned about speed as
> this is normally done while 'shooting' the seismic data) this data to
> produce the final seismic navigational data, which is then later used
> by other products...
>
> The problem at hand is importing the initial data - no duplicates are
> produced by the program itself later (nor in its output data).
>
> Sadly a large number of later SQL queries assume no duplicates and
> would result in incorrect processing calculations, amongst other
> things. The shear number of these queries makes changing them
> impractical.
>
> > P.S. This falls into the class of problem solving characterized by
> > "if you can't solve the problem as stated, restate the problem to be
> > one you _can_ solve" ;-)
>
> Which is what i've been knocking my head against for the last few
> weeks ;) The real problem is a move away from our current RDMS
> (Ingres) to PostgreSQL will not happen if the performance of the
> product significantly decreases (which it currently has for the import
> stage) and since Ingres already just ignores the duplicates...
>
> I really want to move to PostgreSQL...
>
> Thanks for your input,
>
> --
> Lee Kindness, Senior Software Engineer, Concept Systems Limited.
> http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595

In response to

Re: Bulkloading using COPY - ignore duplicates? at 2001-12-13 16:22:59 from Ross J. Reedstrom

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2001-12-13 18:41:15	Re: Beta 4 - build regression
Previous Message	Doug McNaught	2001-12-13 18:27:08	Re: Platform testing (last call?)