Re: WIP/PoC for parallel backup

From: Ahsan Hadi <ahsan(dot)hadi(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Asif Rehman <asifr(dot)rehman(at)gmail(dot)com>, Asim R P <apraveen(at)pivotal(dot)io>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP/PoC for parallel backup
Date: 2019-08-23 19:15:32
Message-ID: CA+9bhCKucHLemEednfBD1Dz3UvN8Sh6=5oCV=4MfKxiteKX_bQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 23 Aug 2019 at 10:26 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Greetings,
>
> * Asif Rehman (asifr(dot)rehman(at)gmail(dot)com) wrote:
> > On Fri, Aug 23, 2019 at 3:18 PM Asim R P <apraveen(at)pivotal(dot)io> wrote:
> > > Interesting proposal. Bulk of the work in a backup is transferring
> files
> > > from source data directory to destination. Your patch is breaking this
> > > task down in multiple sets of files and transferring each set in
> parallel.
> > > This seems correct, however, your patch is also creating a new process
> to
> > > handle each set. Is that necessary? I think we should try to achieve
> this
> > > using multiple asynchronous libpq connections from a single basebackup
> > > process. That is to use PQconnectStartParams() interface instead of
> > > PQconnectdbParams(), wich is currently used by basebackup. On the
> server
> > > side, it may still result in multiple backend processes per
> connection, and
> > > an attempt should be made to avoid that as well, but it seems
> complicated.
> >
> > Thanks Asim for the feedback. This is a good suggestion. The main idea I
> > wanted to discuss is the design where we can open multiple backend
> > connections to get the data instead of a single connection.
> > On the client side we can have multiple approaches, One is to use
> > asynchronous APIs ( as suggested by you) and other could be to decide
> > between multi-process and multi-thread. The main point was we can extract
> > lot of performance benefit by using the multiple connections and I built
> > this POC to float the idea of how the parallel backup can work, since the
> > core logic of getting the files using multiple connections will remain
> the
> > same, wether we use asynchronous, multi-process or multi-threaded.
> >
> > I am going to address the division of files to be distributed evenly
> among
> > multiple workers based on file sizes, that would allow to get some
> concrete
> > numbers as well as it will also us to gauge some benefits between async
> and
> > multiprocess/thread approach on client side.
>
> I would expect you to quickly want to support compression on the server
> side, before the data is sent across the network, and possibly
> encryption, and so it'd likely make sense to just have independent
> processes and connections through which to do that.

It would be interesting to see the benefits of compression (before the data
is transferred over the network) on top of parallelism. Since there is also
some overhead associated with performing the compression. I agree with your
suggestion of trying to add parallelism first and then try compression
before the data is sent across the network.

>
> Thanks,
>
> Stephen
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ahsan Hadi 2019-08-23 19:19:52 Re: Email to hackers for test coverage
Previous Message Tom Lane 2019-08-23 18:27:47 Re: obsoleting plpython2u and defaulting plpythonu to plpython3u