Re: GSOC'17 project introduction: Parallel COPY execution with errors handling

From: Alexey Kondratov <kondratov(dot)aleksey(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Anastasia Lubennikova <lubennikovaAV(at)gmail(dot)com>
Subject: Re: GSOC'17 project introduction: Parallel COPY execution with errors handling
Date: 2017-06-12 10:52:17
Message-ID: 09534CC6-D594-4070-97BC-5AA93F107477@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for your comments Peter, there are some points that I did not think about before.

> On 9 Jun 2017, at 01:09, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
>> Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
>> to be a large separated task and is out of the current project scope, but
>> maybe there is
>> a relatively simple way to somehow perform internally tuples insert with
>> ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
>> I understand he is the major contributor of UPSERT in PostgreSQL. It would
>> be great
>> if he will answer this question.
>
> I think that there is a way of making COPY use "speculative
> insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
> no inference specification. Whether or not this is useful depends on a
> lot of things.
>

I am not going to start with "speculative insertion" right now, but it would be very
useful, if you give me a point, where to start. Maybe I will at least try to evaluate
the complexity of the problem.

> I think that you need to more formally identify what errors your new
> COPY error handling will need to swallow.
> ...
> My advice right now is: see if you can figure out a way of doing what
> you want without subtransactions at all, possibly by cutting some
> scope. For example, maybe it would be satisfactory to have the
> implementation just ignore constraint violations, but still raise
> errors for invalid input for types.

Initially I was thinking only about malformed rows, e.g. less or extra columns.
Honestly, I did not know that there are so many levels and ways where error
can occur. So currently (and especially after your comments) I prefer to focus
only on the following list of errors:

1) File format issues
a. Less columns than needed
b. Extra columns

2) I am doubt about type mismatch. It is possible to imagine a situation when,
e.g. some integers are exported as int, and some as "int", but I am not sure
that is is a common situation.

3) Some constraint violations, e.g. unique index.

First appeared to be easy achievable without subtransactions. I have created a
proof of concept version of copy, where the errors handling is turned on by default.
Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec)
or GUI version on GitHub https://github.com/ololobus/postgres/pull/1/files <https://github.com/ololobus/postgres/pull/1/files>.
It throws warnings instead of errors for malformed lines with less/extra columns
and reports line number.

Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH
around heap_form_tuple, since it is not yet inserted into the heap.

But third is questionable without subtransactions, since even if we check
constraints once, there maybe various before/after triggers which can modify
tuple, so it will not satisfy them. Corresponding comment inside copy.c states:
"Note that a BR trigger might modify tuple such that the partition constraint is
no satisfied, so we need to check in that case." Thus, there are maybe different
situations here, as I understand. However, it a point where "speculative insertion"
is able to help.

These three cases should cover most real-life scenarios.

> Is there really much value in ignoring errors due to invalid encoding?

Now, I have some doubts about it too. If there is an encoding problem,
it is probably about the whole file, not only a few rows.

Alexey

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2017-06-12 11:12:08 Re: Statement-level rollback
Previous Message amul sul 2017-06-12 10:20:49 Re: remove unnecessary flag has_null from PartitionBoundInfoData