From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Emmanuel Cecchet <manu(at)asterdata(dot)com>, Emmanuel Cecchet <Emmanuel(dot)Cecchet(at)asterdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: COPY enhancements |
Date: | 2009-10-08 18:43:51 |
Message-ID: | 603c8f070910081143h2232509agd137f4023a4c6315@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Oct 8, 2009 at 1:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Oct 8, 2009 at 12:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Another approach that was discussed earlier was to divvy the rows into
>>> batches. Say every thousand rows you sub-commit and start a new
>>> subtransaction. Up to that point you save aside the good rows somewhere
>>> (maybe a tuplestore). If you get a failure partway through a batch,
>>> you start a new subtransaction and re-insert the batch's rows up to the
>>> bad row. This could be pretty awful in the worst case, but most of the
>>> time it'd probably perform well. You could imagine dynamically adapting
>>> the batch size depending on how often errors occur ...
>
>> Yeah, I think that's promising. There is of course the possibility
>> that a row which previously succeeded could fail the next time around,
>> but most of the time that shouldn't happen, and it should be possible
>> to code it so that it still behaves somewhat sanely if it does.
>
> Actually, my thought was that failure to reinsert a previously good
> tuple should cause us to abort the COPY altogether. This is a
> cheap-and-easy way of avoiding sorceror's apprentice syndrome.
> Suppose the failures are coming from something like out of disk space,
> transaction timeout, whatever ... a COPY that keeps on grinding no
> matter what is *not* ideal.
I think you handle that by putting a cap on the total number of errors
you're willing to accept (and in any event you'll always skip the row
that failed, so forward progress can't cease altogether). For out of
disk space or transaction timeout, sure, but you might also have
things like a serialization error that occurs on the reinsert that
didn't occur on the original. You don't want that to kill the whole
bulk load, I would think.
...Robert
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2009-10-08 18:52:55 | Re: Using results from INSERT ... RETURNING |
Previous Message | Pavel Stehule | 2009-10-08 17:58:13 | Re: Issues for named/mixed function notation patch |