Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, reshkekirill(at)gmail(dot)com
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Date: 2024-11-12 08:38:25
Message-ID: 61ec729176e403e5bc32b98bf34f004c@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2024-11-12 14:17, Yugo Nagata wrote:
> On Tue, 12 Nov 2024 14:03:50 +0900
> Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
>
>> On Tue, 12 Nov 2024 01:27:53 +0500
>> Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
>>
>> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>> > >
>> > > On 2024-11-09 21:55, Kirill Reshke wrote:
>> > >
>> > > Thanks for working on this!
>> >
>> > Thanks for reviewing the v7 patch series!
>> >
>> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
>> > > > wrote:
>> > > >>
>> > > >>
>> > > >>
>> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
>> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
>> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
>> > > >> > I do find this behavior to be consistent.
>> > > >>
>> > > >> +1
>> > > >>
>> > > >>
>> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> > > >> > replacements, as if REJECT LIMIT is inf.
>> > > >>
>> > > >> +1
>> > > >
>> > > > After thinking for a while, I'm now more opposed to this approach. I
>> > > > think we should count rows with erroneous data as errors only if
>> > > > null substitution for these rows failed, not the total number of rows
>> > > > which were modified.
>> > > > Then, to respect the REJECT LIMIT option, we compare this number with
>> > > > the limit. This is actually simpler approach IMHO. What do You think?
>> > >
>> > > IMHO I prefer the previous interpretation.
>> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
>> > > is used to specify how many malformed rows are acceptable in the
>> > > "original" data source.
>>
>> I also prefer the previous version.
>>
>> > I do like the first version of interpretation, but I have a struggle
>> > with it. According to this interpretation, we will fail COPY command
>> > if the number
>> > of malformed data rows exceeds the limit, not the number of rejected
>> > rows (some percentage of malformed rows are accepted with null
>> > substitution)

I feel your concern is valid.
Currently 'reject' can occur only when converting a column's input value
to its data type, but if we introduce set_to_null option 'reject' also
occurs when inserting null, i.e. not null constraint.

>> > So, a proper name for the limit will be MALFORMED_LIMIT, or something.
>> > However, we are unable to change the name since the REJECT_LIMIT
>> > option has already been committed.
>> > I guess I'll just have to put up with this contradiction. I will send
>> > an updated patch shortly...
>> I think we can rename the REJECT_LIMIT option because it is not yet
>> released.

+1

>> The documentation says that REJECT_LIMIT "Specifies the maximum number
>> of errors",
>> and there are no wording "reject" in the description, so I wonder it
>> is unclear
>> what means in "REJECT" in REJECT_LIMIT. It may be proper to use
>> ERROR_LIMIT
>> since it is supposed to be used with ON_ERROR.
>>
>> Alternatively, if we emphasize that errors are handled other than
>> terminating
>> the command,perhaps MALFORMED_LIMIT as proposed above or
>> TOLERANCE_LIMIT may be
>> good, for example.
>
> I might misunderstand the meaning of the name. If REJECT_LIMIT means "a
> limit on
> the number of rows with any malformed value allowed before the COPY
> command is
> rejected", we would not have to rename it.

The meaning of REJECT_LIMIT is what you described, and I think Kirill
worries about cases when malformed rows are accepted(=not REJECTed) with
null substitution. REJECT_LIMIT counts this case as REJECTed.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-12 08:45:29 Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
Previous Message Nazir Bilal Yavuz 2024-11-12 08:38:11 Re: Adding NetBSD and OpenBSD to Postgres CI