Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Date: 2024-11-11 20:27:53
Message-ID: CALdSSPhKs69u20a5CfYaYRegmfoz13+cjBngkC=RkTMvqtbH2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 11 Nov 2024 at 16:11, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>
> On 2024-11-09 21:55, Kirill Reshke wrote:
>
> Thanks for working on this!

Thanks for reviewing the v7 patch series!

> > On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
> > wrote:
> >>
> >>
> >>
> >> On 2024/10/26 6:03, Kirill Reshke wrote:
> >> > when the REJECT LIMIT is set to some non-zero number and the number of
> >> > row NULL replacements exceeds the limit, is it OK to fail. Because
> >> > there WAS errors, and we should not tolerate more than $limit errors .
> >> > I do find this behavior to be consistent.
> >>
> >> +1
> >>
> >>
> >> > But what if we don't set a REJECT LIMIT, it is sane to do all
> >> > replacements, as if REJECT LIMIT is inf.
> >>
> >> +1
> >
> > After thinking for a while, I'm now more opposed to this approach. I
> > think we should count rows with erroneous data as errors only if
> > null substitution for these rows failed, not the total number of rows
> > which were modified.
> > Then, to respect the REJECT LIMIT option, we compare this number with
> > the limit. This is actually simpler approach IMHO. What do You think?
>
> IMHO I prefer the previous interpretation.
> I'm not sure this is what people expect, but I assume that REJECT_LIMIT
> is used to specify how many malformed rows are acceptable in the
> "original" data source.

I do like the first version of interpretation, but I have a struggle
with it. According to this interpretation, we will fail COPY command
if the number
of malformed data rows exceeds the limit, not the number of rejected
rows (some percentage of malformed rows are accepted with null
substitution)
So, a proper name for the limit will be MALFORMED_LIMIT, or something.
However, we are unable to change the name since the REJECT_LIMIT
option has already been committed.
I guess I'll just have to put up with this contradiction. I will send
an updated patch shortly...

> >> > But while I was trying to implement that, I realized that I don't
> >> > understand v4 of this patch. My misunderstanding is about
> >> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
> >> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
> >> > do we do that? My thought is we should try to execute
> >> > InputFunctionCallSafe with NULL value (i mean, here [1]) for the
> >> > column after we failed to insert the input value. And, if this second
> >> > call is successful, we do replacement, otherwise we count the row as
> >> > erroneous.
> >>
> >> Your concern is valid. Allowing NULL to be stored in a column with a
> >> NOT NULL
> >> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you
> >> suggested,
> >> NULL values set by SET_TO_NULL should probably be re-evaluated.
> >
> > Thank you. I updated the patch with a NULL re-evaluation.
> >
> >
> > PFA v7. I did not yet update the doc for this patch version, waiting
> > for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
> >
>
>
> There were warnings when I applied the patch:
>
> $ git apply
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170:
> trailing whitespace.
> /*
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189:
> trailing whitespace.
> /* If datatype if okay with NULL,
> replace
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196:
> trailing whitespace.
>
> v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212:
> trailing whitespace.
> /*
>
> > @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState
> > *pstate, bool is_from)
> > parser_errposition(pstate, def->location)));
> ...
> >
> > - if (opts_out->reject_limit && !opts_out->on_error)
> > + if (opts_out->reject_limit && !(opts_out->on_error ==
> > COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))
>
> Hmm, is this change necessary?
> Personally, I feel the previous code is easier to read.
>
>
> "REJECT LIMIT" should be "REJECT_LIMIT"?
> > 1037 errhint("Consider specifying the
> > REJECT LIMIT option to skip erroneous rows.")));
>
>
> SET_TO_NULL is one of the value for ON_ERROR, but the patch treats
> SET_TO_NULL as option for COPY:
>
> 221 --- a/src/bin/psql/tab-complete.in.c
> 222 +++ b/src/bin/psql/tab-complete.in.c
> 223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
> 224 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
> 225 "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
> 226 "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING",
> "DEFAULT",
> 227 - "ON_ERROR", "LOG_VERBOSITY");
> 228 + "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");
>
>
> > Best regards,
> > Kirill Reshke
>
> --
> Regards,
>
> --
> Atsushi Torikoshi
> Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

--
Best regards,
Kirill Reshke

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-11-11 20:56:11 Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Previous Message Masahiko Sawada 2024-11-11 20:20:40 Re: UUID v7