Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Date: 2024-11-11 11:11:02
Message-ID: 501dd655ddb04693c15baeb6485bc601@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2024-11-09 21:55, Kirill Reshke wrote:

Thanks for working on this!

> On Thu, 7 Nov 2024 at 23:00, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
> wrote:
>>
>>
>>
>> On 2024/10/26 6:03, Kirill Reshke wrote:
>> > when the REJECT LIMIT is set to some non-zero number and the number of
>> > row NULL replacements exceeds the limit, is it OK to fail. Because
>> > there WAS errors, and we should not tolerate more than $limit errors .
>> > I do find this behavior to be consistent.
>>
>> +1
>>
>>
>> > But what if we don't set a REJECT LIMIT, it is sane to do all
>> > replacements, as if REJECT LIMIT is inf.
>>
>> +1
>
> After thinking for a while, I'm now more opposed to this approach. I
> think we should count rows with erroneous data as errors only if
> null substitution for these rows failed, not the total number of rows
> which were modified.
> Then, to respect the REJECT LIMIT option, we compare this number with
> the limit. This is actually simpler approach IMHO. What do You think?

IMHO I prefer the previous interpretation.
I'm not sure this is what people expect, but I assume that REJECT_LIMIT
is used to specify how many malformed rows are acceptable in the
"original" data source.

>> > But while I was trying to implement that, I realized that I don't
>> > understand v4 of this patch. My misunderstanding is about
>> > `t_on_error_null` tests. We are allowed to insert a NULL value for the
>> > first column of t_on_error_null using COPY ON_ERROR SET_TO_NULL. Why
>> > do we do that? My thought is we should try to execute
>> > InputFunctionCallSafe with NULL value (i mean, here [1]) for the
>> > column after we failed to insert the input value. And, if this second
>> > call is successful, we do replacement, otherwise we count the row as
>> > erroneous.
>>
>> Your concern is valid. Allowing NULL to be stored in a column with a
>> NOT NULL
>> constraint via COPY ON_ERROR=SET_TO_NULL does seem unexpected. As you
>> suggested,
>> NULL values set by SET_TO_NULL should probably be re-evaluated.
>
> Thank you. I updated the patch with a NULL re-evaluation.
>
>
> PFA v7. I did not yet update the doc for this patch version, waiting
> for feedback about REJECT LIMIT + SET_TO_NULL behaviour.
>

There were warnings when I applied the patch:

$ git apply
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:170:
trailing whitespace.
/*
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:181:
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:189:
trailing whitespace.
/* If datatype if okay with NULL,
replace
v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:196:
trailing whitespace.

v7-0001-Incrtoduce-COPY-option-to-replace-columns-contain.patch:212:
trailing whitespace.
/*

> @@ -403,12 +403,14 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState
> *pstate, bool is_from)
> parser_errposition(pstate, def->location)));
...
>
> - if (opts_out->reject_limit && !opts_out->on_error)
> + if (opts_out->reject_limit && !(opts_out->on_error ==
> COPY_ON_ERROR_NULL || opts_out->on_error == COPY_ON_ERROR_IGNORE))

Hmm, is this change necessary?
Personally, I feel the previous code is easier to read.

"REJECT LIMIT" should be "REJECT_LIMIT"?
> 1037 errhint("Consider specifying the
> REJECT LIMIT option to skip erroneous rows.")));

SET_TO_NULL is one of the value for ON_ERROR, but the patch treats
SET_TO_NULL as option for COPY:

221 --- a/src/bin/psql/tab-complete.in.c
222 +++ b/src/bin/psql/tab-complete.in.c
223 @@ -3235,7 +3235,7 @@ match_previous_words(int pattern_id,
224 COMPLETE_WITH("FORMAT", "FREEZE", "DELIMITER", "NULL",
225 "HEADER", "QUOTE", "ESCAPE", "FORCE_QUOTE",
226 "FORCE_NOT_NULL", "FORCE_NULL", "ENCODING",
"DEFAULT",
227 - "ON_ERROR", "LOG_VERBOSITY");
228 + "ON_ERROR", "SET_TO_NULL", "LOG_VERBOSITY");

> Best regards,
> Kirill Reshke

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-11-11 11:11:51 Re: doc: pgevent.dll location
Previous Message Joel Jacobson 2024-11-11 10:45:27 Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes