Re: Parallel copy

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Ants Aasma <ants(at)cybertec(dot)at>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-03-03 05:55:13
Message-ID: CALDaNm3NNEmFTzJM1hiOsKrxrRecWp=sLzcG4fqHZwf+WDP1_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 26, 2020 at 4:24 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >
> > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> > >Hi,
> > >
> > >> The one piece of information I'm missing here is at least a very
rough
> > >> quantification of the individual steps of CSV processing - for
example
> > >> if parsing takes only 10% of the time, it's pretty pointless to
start by
> > >> parallelising this part and we should focus on the rest. If it's 50%
it
> > >> might be a different story. Has anyone done any measurements?
> > >
> > >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> > >way more than 10%.
> > >
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that. I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>

I have tried to capture the execution time taken for 3 scenarios which I
felt could give a fair idea:
Test1 (Table with 3 indexes and 1 trigger)
Test2 (Table with 2 indexes)
Test3 (Table without indexes/triggers)

I have captured the following details:
File Read time - time taken to read the file from CopyGetData function.
Read line Time - time taken to read line from NextCopyFrom function(read
time & tokenise time excluded)
Tokenize Time - time taken to tokenize the contents from
NextCopyFromRawFields function.
Data Execution Time - remaining execution time from the total time

The execution breakdown for the tests are given below:
Test/ Time(In Seconds) Total Time File Read Time Read line /Buffer
Read Time Tokenize
Time Data Execution Time
Test1 1693.369 0.256 34.173 5.578 1653.362
Test2 736.096 0.288 39.762 6.525 689.521
Test3 112.06 0.266 39.189 6.433 66.172
Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count
text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count
text);

CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx2_census2 on census2(ethnic);

CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
INSERT INTO census3 SELECT * FROM census2 limit 1;
RETURN NEW;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER census2_trigger AFTER INSERT ON census2 FOR EACH ROW
EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count
text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count
text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Note: The Data8277.csv used was the same data that Ants aasma had used.

From the above result we could infer that Read line will have to be done
sequentially. Read line time takes about 2.01%, 5.40% and 34.97%of the
total time. I felt we will be able to parallelise the remaining phases of
the copy. The performance improvement will vary based on the
scenario(indexes/triggers), it will be proportionate to the number of
indexes and triggers. Read line can also be parallelised in txt format(non
csv). I feel parallelising copy could give significant improvement in quite
some scenarios.

Further I'm planning to see how the execution will be for toast table. I'm
also planning to do test on RAM disk where I will configure the data on RAM
disk, so that we can further eliminate the I/O cost.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-03-03 06:03:35 Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side
Previous Message Shinoda, Noriyoshi (PN Japan A&PS Delivery) 2020-03-03 05:37:57 RE: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side