Re: Add new COPY option REJECT_LIMIT

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: zhjwpku(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Add new COPY option REJECT_LIMIT
Date: 2024-07-23 14:06:32
Message-ID: 5f807fcf3a36df7ba41464ab40b5c37d@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao
<masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:

Thanks for your review.

> On 2024/07/22 21:37, torikoshia wrote:
>> On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku(at)gmail(dot)com>
>> wrote:
>> Thanks for the comment.
>>
>>> In patch 0002, the ratio is calculated by the already
>>> skipped/processed
>>> rows, but what if a user wants to copy 1000 rows, and he/she can
>>> tolerate
>>> 10 error rows, so he/she might set *reject_limit 0.01*, but one bad
>>> row in the
>>> first 100 rows will fail the entire command, this might surprise the
>>> user.
>>
>> Since the ratio is calculated after all data is processed, the case
>> "one bad row in the first 100 rows will fail the entire command"
>> doesn't happen:
>
> Yes, but is this the desired behavior when using the ratio threshold?
> I was thinking that COPY should fail once the error ratio (errors vs.
> total rows in the input file) reaches the threshold.

Users might expect like you.
However, implementing it would need a row number counting feature as you
pointed out, and it seems an overkill.

Describing the difference between ratio and number in the manual might
help, but
it might be better to make REJECT_LIMIT support only the number of
errors and leave it to the user to calculate the number from the ratio.

I'd like to hear if anyone has an opinion on the need for supporting
ratio.
I remember David prefers ratio[1].

> + This option must be used with <literal>ON_ERROR</literal> to be
> set to
> + other than <literal>stop</literal>.
>
> Regarding the ON_ERROR option, now it only has two values.
> Instead of saying "other than stop," we should simply use "ignore"
> for clarity and intuition?

I'll Modify it.
The reason for the roundabout wording was the expectation that on_error
would support values other than these in the future, but as you point
out, there are currently only two.

> + When specified <literal>INFINITY</literal>,
> <command>COPY</command> ignores all
> + the errors. This is a synonym for <literal>ON_ERROR</literal>
> <literal>ignore</literal>.
>
> For the INFINITY value, the description is a bit unclear.
> As I understand it, INFINITY is the default for REJECT_LIMIT.
> So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
> COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is
> used.
>
>
> In line with my previous suggestion, if we support only REJECT_LIMIT
> without ON_ERROR, having INFINITY makes sense. However,
> with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
> Users can just set ON_ERROR=ignore to ignore all errors,
> making INFINITY unnecessary for REJECT_LIMIT. This is open for
> discussion, but I believe it's better to remove INFINITY from
> the REJECT_LIMIT options.

Agreed.
Unless there are opposing opinions, I'm going to remove 'INFINITY' in
the next patch.

> + else if (strcmp(defel->defname, "reject_limit") == 0)
> + {
> + if (reject_limit_specified)
> + errorConflictingDefElem(defel, pstate);
> + if (!opts_out->on_error)
> + ereport(ERROR,
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("REJECT_LIMIT
> requires ON_ERROR to be set to other than stop")));
>
> Using "ignore" instead of "other than stop" in the error message
> is clearer and more intuitive.

Agreed.

> Checking if ON_ERROR and REJECT_LIMIT are specified should be
> done after the foreach() processing of the options. Otherwise,
> if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can
> occur.
>
> ---------------
> =# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
> ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
> ---------------

Ugh, I'll modify it.

> + ereport(ERROR,
> +
> (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> + errmsg("exceeded the
> number specified by REJECT_LIMIT \"%lld\"",
> + (long
> long) cstate->opts.reject_limits.num_err)));
>
> The error message isn't clear about what exceeded REJECT_LIMIT.
> How about saying "skipped more than REJECT_LIMIT rows" or something
> instead?

Agreed.

> +/*
> + * A struct to hold reject_limit options, in a parsed form.
> + * More values to be added in another patch.
> + */
>
> The latter comment doesn't seem necessary or helpful.

Agreed.

[1]
https://www.postgresql.org/message-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_%3DzuWjkz1%2B25Nd8bpsrDkx5Q%40mail.gmail.com

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2024-07-23 14:10:24 Re: Add new COPY option REJECT_LIMIT
Previous Message David Christensen 2024-07-23 13:42:07 Re: [PATCH] GROUP BY ALL