Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>, Damir Belyalov <dam(dot)bel07(at)gmail(dot)com>, zhihuifan1213(at)163(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Daniel Gustafsson <daniel(at)yesql(dot)se>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, anisimow(dot)d(at)gmail(dot)com, HukuToc(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date: 2023-12-18 00:15:00
Message-ID: CACJufxHi53OpGYPAe6SdCb4m=-+H8L+7LDbUWvTiJp=V4YYEqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 15, 2023 at 4:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Hi,
>
> I've read this thread and the latest patch. IIUC with SAVE_ERROR
> option, COPY FROM creates an error table for the target table and
> writes error information there.
>
> While I agree that the final shape of this feature would be something
> like that design, I'm concerned some features are missing in order to
> make this feature useful in practice. For instance, error logs are
> inserted to error tables without bounds, meaning that users who want
> to tolerate errors during COPY FROM will have to truncate or drop the
> error tables periodically, or the database will grow with error logs
> without limit. Ideally such maintenance work should be done by the
> database. There might be some users who want to log such conversion
> errors in server logs to avoid such maintenance work. I think we
> should provide an option for where to write, at least. Also, since the
> error tables are normal user tables internally, error logs are also
> replicated to subscribers if there is a publication FOR ALL TABLES,
> unlike system catalogs. I think some users would not like such
> behavior.

save the error metadata to system catalogs would be more expensive,
please see below explanation.
I have no knowledge of publications.
but i feel there is a feature request: publication FOR ALL TABLES
exclude regex_pattern.
Anyway, that would be another topic.

> Looking at SAVE_ERROR feature closely, I think it consists of two
> separate features. That is, it enables COPY FROM to load data while
> (1) tolerating errors and (2) logging errors to somewhere (i.e., an
> error table). If we implement only (1), it would be like COPY FROM
> tolerate errors infinitely and log errors to /dev/null. The user
> cannot see the error details but I guess it could still help some
> cases as Andres mentioned[1] (it might be a good idea to send the
> number of rows successfully loaded in a NOTICE message if some rows
> could not be loaded). Then with (2), COPY FROM can log error
> information to somewhere such as tables and server logs and the user
> can select it. So I'm thinking we may be able to implement this
> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would
> be to support logging destinations such as server logs and tables.
>
> Regards,
>
> [1] https://www.postgresql.org/message-id/20231109002600.fuihn34bjqqgmbjm%40awork3.anarazel.de
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

> feature incrementally. The first step would be something like an
> option to ignore all errors or an option to specify the maximum number
> of errors to tolerate before raising an ERROR. The second step would

I don't think "specify the maximum number of errors to tolerate
before raising an ERROR." is very useful....

QUOTE from [1]
MAXERROR [AS] error_count
If the load returns the error_count number of errors or greater, the
load fails. If the load returns fewer errors, it continues and returns
an INFO message that states the number of rows that could not be
loaded. Use this parameter to allow loads to continue when certain
rows fail to load into the table because of formatting errors or other
inconsistencies in the data.
Set this value to 0 or 1 if you want the load to fail as soon as the
first error occurs. The AS keyword is optional. The MAXERROR default
value is 0 and the limit is 100000.
The actual number of errors reported might be greater than the
specified MAXERROR because of the parallel nature of Amazon Redshift.
If any node in the Amazon Redshift cluster detects that MAXERROR has
been exceeded, each node reports all of the errors it has encountered.
END OF QUOTE

option MAXERROR error_count. iiuc, it fails while validating line
error_count + 1, else it raises a notice, tells you how many rows have
errors.

* case when error_count is small, and the copy fails, it only tells
you that at least the error_count line has malformed data. but what if
the actual malformed rows are very big. In this case, this failure
error message is not that helpful.
* case when error_count is very big, and the copy does not fail. then
the actual malformed data rows are very big (still less than
error_count). but there is no error report, you don't know which line
has an error.

Either way, if the file has a large portion of malformed rows, then
the MAXERROR option does not make sense.
so maybe we don't need a threshold for tolerating errors.

however, we can have an option, not actually copy to the table, but
only validate, similar to NOLOAD in [1]

why we save the error:
* if only a small portion of malformed rows then saving the error
metadata would be cheap.
* if a large portion of malformed rows then copy will be slow but we
saved the error metadata. Now you can fix it based on this error
metadata.

I think saving errors to a regular table or text file seems sane, but
not to a catalog table.
* for a text file with M rows, N fields, contrived corner case would
be (M-2) * N errors, the last 2 rows have the duplicate keys, violate
primary key constraint. In this case, we first insert (M-2) * N rows
to the catalog table then because of errors we undo it.
I think it will be expensive.
* error meta info is not as important as other pg_catalog tables.

log format is quite verbose, save_error to log seems not so good, I guess.

I suppose we can specify an ERRORFILE directory. similar
implementation [2], demo in [3]
it will generate 2 files, one file shows the malform line content as
is, another file shows the error info.

Let's assume we save the error info to a table:
Since the previous thread says one copy operation may create one error
table is not a good idea, looking back, I agree.
Similar to [4]
I come with the following logic/ideas:
* save_error table name be COPY_ERRORS, shema be the same as copy from
destination table.
* one COPY_ERRORS table saves all COPY FROM generated error metadata
* if save_error specified, before do COPY FROM, first check if the
table COPY_ERRORS
exists,
if not then create one? Or raise an error saying that COPY_ERRORS does
not exist, cannot save_error?
* COPY_ERRORS table owner be current database owner?
* Only the table owner is allowed to INSERT/DELETE/UPDATE, others are
not allowed to INSERT/DELETE/UPDATE.
while doing copy error happened, record the userid, then switch
COPY_ERRORS owner execute the insert command
* the user who is doing COPY FROM operation is allowed solely to view
(select) the errored row they generated.

COPY_ERRORS table would be:
userid oid /* the user who is doing this operation */
error_time timestamptz /* when this error
happened. not 100% sure this column is needed */
filename text /* the copy from source */
table_name text /* the copy from destination */
lineno bigint /* the error line number */
line text /* the whole line raw content */
colname text -- Field with the error.
raw_field_value text --- The value for the field that leads to the error.
err_message text -- same as ErrorData->message
err_detail text --same as ErrorData->detail
errorcode text --transformed errcode, example "22P02"

[1] https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[2] https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver16
[3] https://www.sqlshack.com/working-with-line-numbers-and-errors-using-bulk-insert/
[4] https://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2023-12-18 00:19:00 Re: [PATCH]: Not to invaldiate CatalogSnapshot for local invalidation messages
Previous Message Matthias van de Meent 2023-12-17 23:53:34 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements