From: | Hannu Krosing <hannuk(at)google(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Nitin Motiani <nitinmotiani(at)google(dot)com> |
Subject: | Re: Adding pg_dump flag for parallel export to pipes |
Date: | 2025-04-22 12:10:58 |
Message-ID: | CAMT0RQQ1U3gLfeBhmk0Lef-1=wm707Ce9HuugMx_X5AXoG0Rbg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
If there are no objections we will add this to the commitfest
On Mon, Apr 7, 2025 at 9:48 PM Hannu Krosing <hannuk(at)google(dot)com> wrote:
>
>
> Just to bring this out separately : Does anybody have any idea why pipe commands close inside tests ?
>
> Re: 003-pg_dump_basic_tests has a few basic validation tests for
> correctmflag combinations. We need to write more automated tests in
> 002_pg_dump.pl but have been running into some issues with environment
> setup due to which certain pipe commands result in the shell process
> becoming defunct. These same commands are working fine in manual
> testing. We are still looking into this.
>
> ----
> Hannu
>
>
> On Mon, Apr 7, 2025 at 7:17 PM Nitin Motiani <nitinmotiani(at)google(dot)com> wrote:
>>
>> Hi Hackers,
>>
>> We are proposing the ability to specify a pipe command to pg_dump by a
>> flag. And attaching the patch set.
>>
>> Why : Currently it is quite simple to pipe the output of pg_dump for
>> text format to a pipe at command line and do any manipulations
>> necessary. Following is an example :
>>
>> pg_dump <flags> <dbname> | lz4 | pv -L 10k | ssh remote.host
>> "cat - > remote.dump.lz4"
>>
>> Here we first compress the stream using lz4 and then send it over ssh
>> to a remote host to be saved as a file while rate-limiting the network
>> usage to 10KB/s.
>>
>> Something like this is not possible for format=directory (-Fd) since
>> all you can provide is the directory name to store the individual
>> files. Note it is not possible to do this irrespective of the usage of
>> the parallel dump option ('--jobs' flag).
>>
>> While the directory format supports compression using a flag, the rest
>> of the operations in the above example are not possible. And a pipe
>> command provides more flexibility in what compression algorithm one
>> wants to use.
>>
>> This patch set provides pg_dump the ability to pipe the data in the
>> directory mode by using a new flag '--pipe-command' (in both parallel
>> and non-parallel mode).
>>
>> We also add a similar option to pg_restore.
>>
>> The following can be the major use cases of these changes :
>> 1. Stream pg_dump output to a cloud storage
>> 2. SSH the data to a remote host (with or without throttling)
>> 3. Custom compression options
>>
>>
>> Usage Examples : Here is an example of how the pipe-command will look like.
>>
>> pg_dump -Fd mydb --pipe-command="cat > dumpdir/%f" (dumpdir
>> should exist beforehand.)
>>
>> This is equivalent to
>>
>> pg_dump -Fd mydb --file=dumpdir
>>
>> (Please note that the flags '--file' or '--pipe-command' can't be used
>> together.)
>>
>> For the more complex scenario as mentioned above, the command will be
>> (with the parallelism of 5) :
>>
>> pg_dump -Fd mydb -j 5 --pipe-command="lz4 | pv -L 10k | ssh
>> remote.host "cat > dumpdir/%f""
>>
>> Please note the use of %f in the above examples. As a user would
>> almost always want to write the post-processing output to a file (or
>> perhaps a cloud location), we provide a format specifier %f in the
>> command. The implementation of pipe-command replaces these format
>> specifiers with the corresponding file names. These file names are the
>> same as they would be in the current usage of directory format with
>> '--file' flag (<dump_id>.dat, toc.dat, blob_NNN.toc,
>> blob_<blob_id>.dat).
>>
>> The usage of this flag with pg_restore will also be similar. Here is
>> an example of restoring from a gzip compressed dump directory.
>>
>> pg_restore -C -Fd -d postgres --pipe-commnad="cat
>> dumpdir/%f.gz | gunzip"
>>
>> The new flag in pg_restore also works with '-l' and '-L' options
>>
>> pg_restore -C -Fd -d postgres --pipe-commnad="cat dumpdir/%f" -L db.list
>>
>>
>> Implementation Details : Here are the major changes :
>> 1. We reuse the same variables which store the file name to store
>> the pipe command. And add a new bool fSpecIsPipe in _archiveHandle
>> (similar bools in pg_dump.c and pg_restore.c) to specify if it's a
>> pipe command.
>> 2. In the cases when the above bool is set to true, we use popen
>> and pclose instead of fopen and fclose.
>> 3. To enable the format specifier %f in the pipe-command, we make
>> changes to the file name creation logic in a few places. Currently the
>> file name (corresponding to a table or large object) is appended to
>> the directory name provided by '--file' command. In case of
>> '--pipe-command', we use 'replace_percent_placeholders' to replace %f
>> with the corresponding file name. This change is made for both table
>> files and LO TOC files.
>>
>> With these core changes, the rest of the code continues working as-is.
>>
>> We are attaching 4 patches for this change :
>>
>> 001-pg_dump_pipe has the pg_dump pipe support code.
>> 002-pg_restore_pipe has the pg_restore pipe support.
>> 003-pg_dump_basic_tests has a few basic validation tests for
>> correctmflag combinations. We need to write more automated tests in
>> 002_pg_dump.pl but have been running into some issues with environment
>> setup due to which certain pipe commands result in the shell process
>> becoming defunct. These same commands are working fine in manual
>> testing. We are still looking into this.
>> 004-pg_dump_documentation has the proposed documentation changes.
>>
>> We are working on the above test issues and cleanup of the patches.
>>
>> Open Questions : There are a couple of open questions in the implementation :
>>
>> 1. Currently the LO TOC file (blob_NNN.toc) is opened in the
>> append mode. This is not possible with popen for the pipe command.
>> From reading the code, it seems to us that this file doesn't need to
>> be opened in the append mode. As '_StartLOs' is called once per
>> archive entry in WriteDataChunksForToCEntry followed by the dumper
>> function and then '_EndLOs', it should be okay to change this to 'w'
>> mode. But this code has been there since the start so we haven't made
>> that change yet. In the patch, we have changed it to 'w' pipe-command
>> only and added the ideas for potential solutions in the comments.
>> 2. We are also not sure yet on how to handle the environment
>> issues when trying to add new tests to 002_pg_dump.pl.
>>
>> Please let us know what you think.
>>
>> Thanks & Regards,
>> Nitin Motiani
>> Google
From | Date | Subject | |
---|---|---|---|
Next Message | Anthonin Bonnefoy | 2025-04-22 12:37:19 | Re: Add Pipelining support in psql |
Previous Message | Aleksander Alekseev | 2025-04-22 12:10:51 | Re: Cygwin support |