Quick Links

Re: Bulkloading using COPY - ignore duplicates?

From:	Lee Kindness <lkindness(at)csl(dot)co(dot)uk>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Lee Kindness <lkindness(at)csl(dot)co(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Bulkloading using COPY - ignore duplicates?
Date:	2001-10-01 13:54:25
Message-ID:	15288.30097.74590.206271@elsick.csl.co.uk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Tom Lane writes:
> Lee Kindness <lkindness(at)csl(dot)co(dot)uk> writes:
> > Would this seem a reasonable thing to do? Does anyone rely on COPY
> > FROM causing an ERROR on duplicate input?
> Yes. This change will not be acceptable unless it's made an optional
> (and not default, IMHO, though perhaps that's negotiable) feature of
> COPY.

I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!

> The implementation might be rather messy too. I don't much care
> for the notion of a routine as low-level as bt_check_unique knowing
> that the context is or is not COPY. We might have to do some
> restructuring.

Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:

/*
* If we're not allowing duplicates, make sure the key isn't already
* in the index. XXX this belongs somewhere else, likely
*/

So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?

> > Would:
> > WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
> > need to be added to the COPY command (I hope not)?
> It occurs to me that skip-the-insert might be a useful option for
> INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
> the regular discussions we see on whether to do INSERT first or
> UPDATE first when the key might already exist.) Maybe a SET variable
> that applies to all forms of insertion would be appropriate.

That makes quite a bit of sense.

--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.

In response to

Re: Bulkloading using COPY - ignore duplicates? at 2001-10-01 13:36:36 from Tom Lane

Responses

Re: Bulkloading using COPY - ignore duplicates? at 2001-10-01 14:02:54 from Tom Lane
Re: Bulkloading using COPY - ignore duplicates? at 2001-10-01 14:21:22 from Thomas Swan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2001-10-01 14:02:54	Re: Bulkloading using COPY - ignore duplicates?
Previous Message	Justin Clift	2001-10-01 13:48:19	Re: Spinlock performance improvement proposal