Re: Add new COPY option REJECT_LIMIT

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: zhjwpku(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Add new COPY option REJECT_LIMIT
Date: 2024-09-24 05:25:47
Message-ID: 056a5728cf733685cff7d9b5a8fbb0ad@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Updated the patch.

On 2024-07-23 23:06, torikoshia wrote:

> On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao
> <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
> Thanks for your review.
>
>> On 2024/07/22 21:37, torikoshia wrote:
>>> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku(at)gmail(dot)com>
>>> wrote:
>>> Thanks for the comment.
>>>
>>>> In patch 0002, the ratio is calculated by the already
>>>> skipped/processed
>>>> rows, but what if a user wants to copy 1000 rows, and he/she can
>>>> tolerate
>>>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad
>>>> row in the
>>>> first 100 rows will fail the entire command, this might surprise the
>>>> user.
>>>
>>> Since the ratio is calculated after all data is processed, the case
>>> "one bad row in the first 100 rows will fail the entire command"
>>> doesn't happen:
>>
>> Yes, but is this the desired behavior when using the ratio threshold?
>> I was thinking that COPY should fail once the error ratio (errors vs.
>> total rows in the input file) reaches the threshold.
>
> Users might expect like you.
> However, implementing it would need a row number counting feature as
> you pointed out, and it seems an overkill.
>
> Describing the difference between ratio and number in the manual might
> help, but
> it might be better to make REJECT_LIMIT support only the number of
> errors and leave it to the user to calculate the number from the
> ratio.
>
> I'd like to hear if anyone has an opinion on the need for supporting
> ratio.

Since there was no opinion about it, attached a patch only for
REJECT_LIMIT specifying number.

>> + This option must be used with <literal>ON_ERROR</literal> to be
>> set to
>> + other than <literal>stop</literal>.
>>
>> Regarding the ON_ERROR option, now it only has two values.
>> Instead of saying "other than stop," we should simply use "ignore"
>> for clarity and intuition?
>
> I'll Modify it.
> The reason for the roundabout wording was the expectation that
> on_error would support values other than these in the future, but as
> you point out, there are currently only two.
>
>
>> + When specified <literal>INFINITY</literal>,
>> <command>COPY</command> ignores all
>> + the errors. This is a synonym for <literal>ON_ERROR</literal>
>> <literal>ignore</literal>.
>>
>> For the INFINITY value, the description is a bit unclear.
>> As I understand it, INFINITY is the default for REJECT_LIMIT.
>> So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
>> COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is
>> used.
>>
>>
>> In line with my previous suggestion, if we support only REJECT_LIMIT
>> without ON_ERROR, having INFINITY makes sense. However,
>> with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
>> Users can just set ON_ERROR=ignore to ignore all errors,
>> making INFINITY unnecessary for REJECT_LIMIT. This is open for
>> discussion, but I believe it's better to remove INFINITY from
>> the REJECT_LIMIT options.
>
> Agreed.
> Unless there are opposing opinions, I'm going to remove 'INFINITY' in
> the next patch.

As there are no opposing opinions, removed 'INFINITY' as well.

>> + else if (strcmp(defel->defname, "reject_limit") == 0)
>> + {
>> + if (reject_limit_specified)
>> + errorConflictingDefElem(defel,
>> pstate);
>> + if (!opts_out->on_error)
>> + ereport(ERROR,
>> +
>> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> + errmsg("REJECT_LIMIT
>> requires ON_ERROR to be set to other than stop")));
>>
>> Using "ignore" instead of "other than stop" in the error message
>> is clearer and more intuitive.
>
> Agreed.
>
>
>> Checking if ON_ERROR and REJECT_LIMIT are specified should be
>> done after the foreach() processing of the options. Otherwise,
>> if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can
>> occur.
>>
>> ---------------
>> =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
>> ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
>> ---------------
>
> Ugh, I'll modify it.
>
>> + ereport(ERROR,
>> +
>> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
>> + errmsg("exceeded the
>> number specified by REJECT_LIMIT \"%lld\"",
>> + (long
>> long) cstate->opts.reject_limits.num_err)));
>>
>> The error message isn't clear about what exceeded REJECT_LIMIT.
>> How about saying "skipped more than REJECT_LIMIT rows" or something
>> instead?
>
> Agreed.
>
>> +/*
>> + * A struct to hold reject_limit options, in a parsed form.
>> + * More values to be added in another patch.
>> + */
>>
>> The latter comment doesn't seem necessary or helpful.
>
> Agreed.
>
> [1]
> https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com

On Tue, Jul 23, 2024 at 11:10 PM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
wrote:
> On 2024-07-23 02:06, Kirill Reshke wrote:

>> If nothing is specified, then the maximum tolerable number of errors
>> is one, right? Should we state this explicitly in the documentation?

> REJECT_LIMIT now can be used wonly when on_error=ignore, I think the
> default(when only on_error=ignore is specified) is unlimited.
> Anyway, I'm going to add a description about the default.

Added the following description:

+ Just setting <literal>ON_ERROR</literal> to
<literal>ignore</literal>
+ tolerates unlimited number of errors.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachment Content-Type Size
v4-0001-Add-new-COPY-option-REJECT_LIMIT-number.patch text/x-diff 8.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-09-24 06:05:42 Re: ANALYZE ONLY
Previous Message Amit Kapila 2024-09-24 05:19:10 Re: Conflict detection for update_deleted in logical replication