Re: Parallel copy

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Ants Aasma <ants(at)cybertec(dot)at>, Alastair Turner <minion(at)decodable(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel copy
Date: 2020-02-26 10:54:01
Message-ID: CAA4eK1KtdjBJtyLjwFrqykBD0SQgK6JJHLmTYRooUJn08EcTCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> >Hi,
> >
> >> The one piece of information I'm missing here is at least a very rough
> >> quantification of the individual steps of CSV processing - for example
> >> if parsing takes only 10% of the time, it's pretty pointless to start by
> >> parallelising this part and we should focus on the rest. If it's 50% it
> >> might be a different story. Has anyone done any measurements?
> >
> >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> >way more than 10%.
> >
>
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first.
>

Agreed.

> I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.
>

Right, I don't know what is the best way to define that. I can think
of the below tests.

1. A table with 10 columns (with datatypes as integers, date, text).
It has one index (unique/primary). Load with 1 million rows (basically
the data should be probably 5-10 GB).
2. A table with 10 columns (with datatypes as integers, date, text).
It has three indexes, one index can be (unique/primary). Load with 1
million rows (basically the data should be probably 5-10 GB).
3. A table with 10 columns (with datatypes as integers, date, text).
It has three indexes, one index can be (unique/primary). It has before
and after trigeers. Load with 1 million rows (basically the data
should be probably 5-10 GB).
4. A table with 10 columns (with datatypes as integers, date, text).
It has five or six indexes, one index can be (unique/primary). Load
with 1 million rows (basically the data should be probably 5-10 GB).

Among all these tests, we can check how much time did we spend in
reading, parsing the csv files vs. rest of execution?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2020-02-26 10:58:50 Re: BUG #16108: Colorization to the output of command-line has unproperly behaviors at Windows platform
Previous Message Michail Nikolaev 2020-02-26 10:48:09 Re: BUG #16108: Colorization to the output of command-line has unproperly behaviors at Windows platform