Re: [PATCH] Performance Improvement For Copy From Binary Files

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Performance Improvement For Copy From Binary Files
Date: 2020-07-01 09:33:06
Message-ID: CALj2ACUUpe+Z6e03cCb5jsTqbFWr=NsjA+Z+UwCj3MW_CYSCCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Added this to commitfest incase this is useful -
https://commitfest.postgresql.org/28/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

On Mon, Jun 29, 2020 at 10:50 AM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:

> Hi Hackers,
>
> For Copy From Binary files, there exists below information for each
> tuple/row.
> 1. field count(number of columns)
> 2. for every field, field size(column data length)
> 3. field data of field size(actual column data)
>
> Currently, all the above data required at each step is read directly from
> file using fread() and this happens for all the tuples/rows.
>
> One observation is that in the total execution time of a copy from binary
> file, the fread() call is taking upto 20% of time and the fread() function
> call count is also too high.
>
> For instance, with a dataset of size 5.3GB, 10million tuples with 10
> columns,
> total exec time in sec total time taken for fread() fread() function call
> count
> 101.193 *21.33* 210000005
> 101.345 *21.436* 210000005
>
> The total time taken for fread() and the corresponding function call count
> may increase if we have more number of columns for instance 1000.
>
> One solution to this problem is to read data from binary file in
> RAW_BUF_SIZE(64KB) chunks to avoid repeatedly calling fread()(thus possibly
> avoiding few disk IOs). This is similar to the approach followed for
> csv/text files.
>
> Attaching a patch, implementing the above solution for binary format files.
>
> Below is the improvement gained.
> total exec time in sec total time taken for fread() fread() function call
> count
> 75.757 *2.73* 160884
> 75.351 *2.742* 160884
>
> *Execution is 1.36X times faster, fread() time is reduced by 87%, fread()
> call count is reduced by 99%.*
>
> Request the community to take this patch for review if this approach and
> improvement seem beneficial.
>
> Any suggestions to improve further are most welcome.
>
> Attached also is the config file used for testing the above use case.
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2020-07-01 09:38:59 Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
Previous Message Daniel Gustafsson 2020-07-01 09:30:41 Re: Built-in connection pooler