Quick Links

Re: Adding pg_dump flag for parallel export to pipes

From:	Hannu Krosing <hannuk(at)google(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Nitin Motiani <nitinmotiani(at)google(dot)com>
Subject:	Re: Adding pg_dump flag for parallel export to pipes
Date:	2025-04-07 19:48:20
Message-ID:	CAMT0RQRPJUVsZbv6B=G6ccoZ+OM2vPOG85B6xDd+vKR6-0wt0g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Just to bring this out separately : Does anybody have any idea why pipe
commands close inside tests ?

Re: 003-pg_dump_basic_tests has a few basic validation tests for
correctmflag combinations. We need to write more automated tests in
002_pg_dump.pl but have been running into some issues with environment
setup due to which certain pipe commands result in the shell process
becoming defunct. These same commands are working fine in manual
testing. We are still looking into this.

----
Hannu

On Mon, Apr 7, 2025 at 7:17 PM Nitin Motiani <nitinmotiani(at)google(dot)com>
wrote:

> Hi Hackers,
>
> We are proposing the ability to specify a pipe command to pg_dump by a
> flag. And attaching the patch set.
>
> Why : Currently it is quite simple to pipe the output of pg_dump for
> text format to a pipe at command line and do any manipulations
> necessary. Following is an example :
>
> pg_dump <flags> <dbname> | lz4 | pv -L 10k | ssh remote.host
> "cat - > remote.dump.lz4"
>
> Here we first compress the stream using lz4 and then send it over ssh
> to a remote host to be saved as a file while rate-limiting the network
> usage to 10KB/s.
>
> Something like this is not possible for format=directory (-Fd) since
> all you can provide is the directory name to store the individual
> files. Note it is not possible to do this irrespective of the usage of
> the parallel dump option ('--jobs' flag).
>
> While the directory format supports compression using a flag, the rest
> of the operations in the above example are not possible. And a pipe
> command provides more flexibility in what compression algorithm one
> wants to use.
>
> This patch set provides pg_dump the ability to pipe the data in the
> directory mode by using a new flag '--pipe-command' (in both parallel
> and non-parallel mode).
>
> We also add a similar option to pg_restore.
>
> The following can be the major use cases of these changes :
> 1. Stream pg_dump output to a cloud storage
> 2. SSH the data to a remote host (with or without throttling)
> 3. Custom compression options
>
>
> Usage Examples : Here is an example of how the pipe-command will look like.
>
> pg_dump -Fd mydb --pipe-command="cat > dumpdir/%f" (dumpdir
> should exist beforehand.)
>
> This is equivalent to
>
> pg_dump -Fd mydb --file=dumpdir
>
> (Please note that the flags '--file' or '--pipe-command' can't be used
> together.)
>
> For the more complex scenario as mentioned above, the command will be
> (with the parallelism of 5) :
>
> pg_dump -Fd mydb -j 5 --pipe-command="lz4 | pv -L 10k | ssh
> remote.host "cat > dumpdir/%f""
>
> Please note the use of %f in the above examples. As a user would
> almost always want to write the post-processing output to a file (or
> perhaps a cloud location), we provide a format specifier %f in the
> command. The implementation of pipe-command replaces these format
> specifiers with the corresponding file names. These file names are the
> same as they would be in the current usage of directory format with
> '--file' flag (<dump_id>.dat, toc.dat, blob_NNN.toc,
> blob_<blob_id>.dat).
>
> The usage of this flag with pg_restore will also be similar. Here is
> an example of restoring from a gzip compressed dump directory.
>
> pg_restore -C -Fd -d postgres --pipe-commnad="cat
> dumpdir/%f.gz | gunzip"
>
> The new flag in pg_restore also works with '-l' and '-L' options
>
> pg_restore -C -Fd -d postgres --pipe-commnad="cat dumpdir/%f" -L
> db.list
>
>
> Implementation Details : Here are the major changes :
> 1. We reuse the same variables which store the file name to store
> the pipe command. And add a new bool fSpecIsPipe in _archiveHandle
> (similar bools in pg_dump.c and pg_restore.c) to specify if it's a
> pipe command.
> 2. In the cases when the above bool is set to true, we use popen
> and pclose instead of fopen and fclose.
> 3. To enable the format specifier %f in the pipe-command, we make
> changes to the file name creation logic in a few places. Currently the
> file name (corresponding to a table or large object) is appended to
> the directory name provided by '--file' command. In case of
> '--pipe-command', we use 'replace_percent_placeholders' to replace %f
> with the corresponding file name. This change is made for both table
> files and LO TOC files.
>
> With these core changes, the rest of the code continues working as-is.
>
> We are attaching 4 patches for this change :
>
> 001-pg_dump_pipe has the pg_dump pipe support code.
> 002-pg_restore_pipe has the pg_restore pipe support.
> 003-pg_dump_basic_tests has a few basic validation tests for
> correctmflag combinations. We need to write more automated tests in
> 002_pg_dump.pl but have been running into some issues with environment
> setup due to which certain pipe commands result in the shell process
> becoming defunct. These same commands are working fine in manual
> testing. We are still looking into this.
> 004-pg_dump_documentation has the proposed documentation changes.
>
> We are working on the above test issues and cleanup of the patches.
>
> Open Questions : There are a couple of open questions in the
> implementation :
>
> 1. Currently the LO TOC file (blob_NNN.toc) is opened in the
> append mode. This is not possible with popen for the pipe command.
> From reading the code, it seems to us that this file doesn't need to
> be opened in the append mode. As '_StartLOs' is called once per
> archive entry in WriteDataChunksForToCEntry followed by the dumper
> function and then '_EndLOs', it should be okay to change this to 'w'
> mode. But this code has been there since the start so we haven't made
> that change yet. In the patch, we have changed it to 'w' pipe-command
> only and added the ideas for potential solutions in the comments.
> 2. We are also not sure yet on how to handle the environment
> issues when trying to add new tests to 002_pg_dump.pl.
>
> Please let us know what you think.
>
> Thanks & Regards,
> Nitin Motiani
> Google
>

In response to

Adding pg_dump flag for parallel export to pipes at 2025-04-07 17:16:58 from Nitin Motiani

Responses

Re: Adding pg_dump flag for parallel export to pipes at 2025-04-22 12:10:58 from Hannu Krosing
Re: Adding pg_dump flag for parallel export to pipes at 2025-04-26 01:07:09 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2025-04-07 19:51:17	Re: Draft for basic NUMA observability
Previous Message	Melanie Plageman	2025-04-07 19:24:43	Re: BAS_BULKREAD vs read stream