GSOC'17 project introduction: Parallel COPY execution with errors handling

From: Alexey Kondratov <kondratov(dot)aleksey(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: GSOC'17 project introduction: Parallel COPY execution with errors handling
Date: 2017-03-23 11:33:53
Message-ID: 7179F2FD-49CE-4093-AE14-1B26C5DFB0DA@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.

I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail(dot)gmail(dot)com#603c8f070909141218i291bc983t501507ebc996a531(at)mail(dot)gmail(dot)com(dot) There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them.

(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better to have well-performing core functionality out of the box.

My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?

I will be very appreciate to any comments and criticism.

P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.

Best regards,

Alexey

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2017-03-23 11:34:01 Re: Patch: Write Amplification Reduction Method (WARM)
Previous Message Michael Paquier 2017-03-23 11:29:10 Re: exposing wait events for non-backends (was: Tracking wait event for latches)