Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: reshkekirill(at)gmail(dot)com, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Date: 2024-11-13 13:02:35
Message-ID: 20241113220235.85b223c663c98da88ae91b24@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 12 Nov 2024 17:38:25 +0900
torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:

> On 2024-11-12 14:17, Yugo Nagata wrote:
> > On Tue, 12 Nov 2024 14:03:50 +0900
> > Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
> >
> >> On Tue, 12 Nov 2024 01:27:53 +0500
> >> Kirill Reshke <reshkekirill(at)gmail(dot)com> wrote:
> >>
> >> > On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> >> > >
> >> > > On 2024-11-09 21:55, Kirill Reshke wrote:
> >> > >
> >> > > Thanks for working on this!
> >> >
> >> > Thanks for reviewing the v7 patch series!
> >> >
> >> > > > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
> >> > > > wrote:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > > >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > > >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > > >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > > >> > I do find this behavior to be consistent.
> >> > > >>
> >> > > >> +1
> >> > > >>
> >> > > >>
> >> > > >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > > >> > replacements, as if REJECT LIMIT is inf.
> >> > > >>
> >> > > >> +1
> >> > > >
> >> > > > After thinking for a while, I'm now more opposed to this approach. I
> >> > > > think we should count rows with erroneous data as errors only if
> >> > > > null substitution for these rows failed, not the total number of rows
> >> > > > which were modified.
> >> > > > Then, to respect the REJECT LIMIT option, we compare this number with
> >> > > > the limit. This is actually simpler approach IMHO. What do You think?
> >> > >
> >> > > IMHO I prefer the previous interpretation.
> >> > > I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> >> > > is used to specify how many malformed rows are acceptable in the
> >> > > "original" data source.
> >>
> >> I also prefer the previous version.
> >>
> >> > I do like the first version of interpretation, but I have a struggle
> >> > with it. According to this interpretation, we will fail COPY command
> >> > if the number
> >> > of malformed data rows exceeds the limit, not the number of rejected
> >> > rows (some percentage of malformed rows are accepted with nulnot-null constraintl
> >> > substitution)
>
> I feel your concern is valid.
> Currently 'reject' can occur only when converting a column's input value
> to its data type, but if we introduce set_to_null option 'reject' also
> occurs when inserting null, i.e. not null constraint.

I can suppose "reject" means failure of COPY command here, that is, a reject
of executing the command, not an error of data input. If so, we can interpret
REJECT_LIMIT as "the number of malformed rows allowed before the COPY command
is REJECTed" (not the number of rejected rows). In this case, I think we don't
have to rename the option name.

Of course, if there is more proper name that makes it easy for users to
understand the behaviour of the option, renaming should be nice.

> >> The documentation says that REJECT_LIMIT "Specifies the maximum number
> >> of errors",
> >> and there are no wording "reject" in the description, so I wonder it
> >> is unclear
> >> what means in "REJECT" in REJECT_LIMIT. It may be proper to use
> >> ERROR_LIMIT
> >> since it is supposed to be used with ON_ERROR.
> >>
> >> Alternatively, if we emphasize that errors are handled other than
> >> terminating
> >> the command,perhaps MALFORMED_LIMIT as proposed above or
> >> TOLERANCE_LIMIT may be
> >> good, for example.
> >
> > I might misunderstand the meaning of the name. If REJECT_LIMIT means "a
> > limit on
> > the number of rows with any malformed value allowed before the COPY
> > command is
> > rejected", we would not have to rename it.
>
> The meaning of REJECT_LIMIT is what you described, and I think Kirill
> worries about cases when malformed rows are accepted(=not REJECTed) with
> null substitution. REJECT_LIMIT counts this case as REJECTed.

I am a bit confused. You mean "REJECT" is raising a soft error of data
input here instead of terminating COPY?

Regards,
Yugo Nagata

--
Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-11-13 13:13:14 Re: BitmapOr node not used in plan for ANY/IN but is for sequence of ORs ...
Previous Message Andrei Lepikhov 2024-11-13 12:59:04 Re: Some dead code in get_param_path_clause_serials()