| From: | torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> | 
|---|---|
| To: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> | 
| Cc: | zhjwpku(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: Add new COPY option REJECT_LIMIT | 
| Date: | 2024-09-24 05:25:47 | 
| Message-ID: | 056a5728cf733685cff7d9b5a8fbb0ad@oss.nttdata.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Updated the patch.
On 2024-07-23 23:06, torikoshia wrote:
> On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao 
> <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> 
> Thanks for your review.
> 
>> On 2024/07/22 21:37, torikoshia wrote:
>>> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku(at)gmail(dot)com> 
>>> wrote:
>>> Thanks for the comment.
>>> 
>>>> In patch 0002, the ratio is calculated by the already 
>>>> skipped/processed
>>>> rows, but what if a user wants to copy 1000 rows, and he/she can 
>>>> tolerate
>>>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad 
>>>> row in the
>>>> first 100 rows will fail the entire command, this might surprise the 
>>>> user.
>>> 
>>> Since the ratio is calculated after all data is processed, the case 
>>> "one bad row in the first 100 rows will fail the entire command" 
>>> doesn't happen:
>> 
>> Yes, but is this the desired behavior when using the ratio threshold?
>> I was thinking that COPY should fail once the error ratio (errors vs.
>> total rows in the input file) reaches the threshold.
> 
> Users might expect like you.
> However, implementing it would need a row number counting feature as
> you pointed out, and it seems an overkill.
> 
> Describing the difference between ratio and number in the manual might 
> help, but
> it might be better to make REJECT_LIMIT support only the number of
> errors and leave it to the user to calculate the number from the
> ratio.
> 
> I'd like to hear if anyone has an opinion on the need for supporting 
> ratio.
Since there was no opinion about it, attached a patch only for 
REJECT_LIMIT specifying number.
>> +      This option must be used with <literal>ON_ERROR</literal> to be 
>> set to
>> +      other than <literal>stop</literal>.
>> 
>> Regarding the ON_ERROR option, now it only has two values.
>> Instead of saying "other than stop," we should simply use "ignore"
>> for clarity and intuition?
> 
> I'll Modify it.
> The reason for the roundabout wording was the expectation that
> on_error would support values other than these in the future, but as
> you point out, there are currently only two.
> 
> 
>> +      When specified <literal>INFINITY</literal>, 
>> <command>COPY</command> ignores all
>> +      the errors. This is a synonym for <literal>ON_ERROR</literal> 
>> <literal>ignore</literal>.
>> 
>> For the INFINITY value, the description is a bit unclear.
>> As I understand it, INFINITY is the default for REJECT_LIMIT.
>> So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
>> COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is 
>> used.
>> 
>> 
>> In line with my previous suggestion, if we support only REJECT_LIMIT
>> without ON_ERROR, having INFINITY makes sense. However,
>> with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
>> Users can just set ON_ERROR=ignore to ignore all errors,
>> making INFINITY unnecessary for REJECT_LIMIT. This is open for
>> discussion, but I believe it's better to remove INFINITY from
>> the REJECT_LIMIT options.
> 
> Agreed.
> Unless there are opposing opinions, I'm going to remove 'INFINITY' in
> the next patch.
As there are no opposing opinions, removed 'INFINITY' as well.
>> +               else if (strcmp(defel->defname, "reject_limit") == 0)
>> +               {
>> +                       if (reject_limit_specified)
>> +                               errorConflictingDefElem(defel, 
>> pstate);
>> +                       if (!opts_out->on_error)
>> +                               ereport(ERROR,
>> +                                               
>> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> +                                                errmsg("REJECT_LIMIT 
>> requires ON_ERROR to be set to other than stop")));
>> 
>> Using "ignore" instead of "other than stop" in the error message
>> is clearer and more intuitive.
> 
> Agreed.
> 
> 
>> Checking if ON_ERROR and REJECT_LIMIT are specified should be
>> done after the foreach() processing of the options. Otherwise,
>> if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can 
>> occur.
>> 
>> ---------------
>> =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
>> ERROR:  REJECT_LIMIT requires ON_ERROR to be set to other than stop
>> ---------------
> 
> Ugh, I'll modify it.
> 
>> +                               ereport(ERROR,
>> +                                               
>> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
>> +                                                errmsg("exceeded the 
>> number specified by REJECT_LIMIT \"%lld\"",
>> +                                                               (long 
>> long) cstate->opts.reject_limits.num_err)));
>> 
>> The error message isn't clear about what exceeded REJECT_LIMIT.
>> How about saying "skipped more than REJECT_LIMIT rows" or something 
>> instead?
> 
> Agreed.
> 
>> +/*
>> + * A struct to hold reject_limit options, in a parsed form.
>> + * More values to be added in another patch.
>> + */
>> 
>> The latter comment doesn't seem necessary or helpful.
> 
> Agreed.
> 
> [1] 
> https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com
On Tue, Jul 23, 2024 at 11:10 PM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> 
wrote:
> On 2024-07-23 02:06, Kirill Reshke wrote:
>> If nothing is specified, then the maximum tolerable number of errors
>> is one, right? Should we state  this explicitly in the documentation?
> REJECT_LIMIT now can be used wonly when on_error=ignore, I think the
> default(when only on_error=ignore is specified) is unlimited.
> Anyway, I'm going to add a description about the default.
Added the following description:
+      Just setting <literal>ON_ERROR</literal> to 
<literal>ignore</literal>
+      tolerates unlimited number of errors.
-- 
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
| Attachment | Content-Type | Size | 
|---|---|---|
| v4-0001-Add-new-COPY-option-REJECT_LIMIT-number.patch | text/x-diff | 8.6 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Rowley | 2024-09-24 06:05:42 | Re: ANALYZE ONLY | 
| Previous Message | Amit Kapila | 2024-09-24 05:19:10 | Re: Conflict detection for update_deleted in logical replication |