From: | Hannu Krosing <hannuk(at)google(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Nitin Motiani <nitinmotiani(at)google(dot)com> |
Subject: | Re: Adding pg_dump flag for parallel export to pipes |
Date: | 2025-04-07 19:48:20 |
Message-ID: | CAMT0RQRPJUVsZbv6B=G6ccoZ+OM2vPOG85B6xDd+vKR6-0wt0g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Just to bring this out separately : Does anybody have any idea why pipe
commands close inside tests ?
Re: 003-pg_dump_basic_tests has a few basic validation tests for
correctmflag combinations. We need to write more automated tests in
002_pg_dump.pl but have been running into some issues with environment
setup due to which certain pipe commands result in the shell process
becoming defunct. These same commands are working fine in manual
testing. We are still looking into this.
----
Hannu
On Mon, Apr 7, 2025 at 7:17 PM Nitin Motiani <nitinmotiani(at)google(dot)com>
wrote:
> Hi Hackers,
>
> We are proposing the ability to specify a pipe command to pg_dump by a
> flag. And attaching the patch set.
>
> Why : Currently it is quite simple to pipe the output of pg_dump for
> text format to a pipe at command line and do any manipulations
> necessary. Following is an example :
>
> pg_dump <flags> <dbname> | lz4 | pv -L 10k | ssh remote.host
> "cat - > remote.dump.lz4"
>
> Here we first compress the stream using lz4 and then send it over ssh
> to a remote host to be saved as a file while rate-limiting the network
> usage to 10KB/s.
>
> Something like this is not possible for format=directory (-Fd) since
> all you can provide is the directory name to store the individual
> files. Note it is not possible to do this irrespective of the usage of
> the parallel dump option ('--jobs' flag).
>
> While the directory format supports compression using a flag, the rest
> of the operations in the above example are not possible. And a pipe
> command provides more flexibility in what compression algorithm one
> wants to use.
>
> This patch set provides pg_dump the ability to pipe the data in the
> directory mode by using a new flag '--pipe-command' (in both parallel
> and non-parallel mode).
>
> We also add a similar option to pg_restore.
>
> The following can be the major use cases of these changes :
> 1. Stream pg_dump output to a cloud storage
> 2. SSH the data to a remote host (with or without throttling)
> 3. Custom compression options
>
>
> Usage Examples : Here is an example of how the pipe-command will look like.
>
> pg_dump -Fd mydb --pipe-command="cat > dumpdir/%f" (dumpdir
> should exist beforehand.)
>
> This is equivalent to
>
> pg_dump -Fd mydb --file=dumpdir
>
> (Please note that the flags '--file' or '--pipe-command' can't be used
> together.)
>
> For the more complex scenario as mentioned above, the command will be
> (with the parallelism of 5) :
>
> pg_dump -Fd mydb -j 5 --pipe-command="lz4 | pv -L 10k | ssh
> remote.host "cat > dumpdir/%f""
>
> Please note the use of %f in the above examples. As a user would
> almost always want to write the post-processing output to a file (or
> perhaps a cloud location), we provide a format specifier %f in the
> command. The implementation of pipe-command replaces these format
> specifiers with the corresponding file names. These file names are the
> same as they would be in the current usage of directory format with
> '--file' flag (<dump_id>.dat, toc.dat, blob_NNN.toc,
> blob_<blob_id>.dat).
>
> The usage of this flag with pg_restore will also be similar. Here is
> an example of restoring from a gzip compressed dump directory.
>
> pg_restore -C -Fd -d postgres --pipe-commnad="cat
> dumpdir/%f.gz | gunzip"
>
> The new flag in pg_restore also works with '-l' and '-L' options
>
> pg_restore -C -Fd -d postgres --pipe-commnad="cat dumpdir/%f" -L
> db.list
>
>
> Implementation Details : Here are the major changes :
> 1. We reuse the same variables which store the file name to store
> the pipe command. And add a new bool fSpecIsPipe in _archiveHandle
> (similar bools in pg_dump.c and pg_restore.c) to specify if it's a
> pipe command.
> 2. In the cases when the above bool is set to true, we use popen
> and pclose instead of fopen and fclose.
> 3. To enable the format specifier %f in the pipe-command, we make
> changes to the file name creation logic in a few places. Currently the
> file name (corresponding to a table or large object) is appended to
> the directory name provided by '--file' command. In case of
> '--pipe-command', we use 'replace_percent_placeholders' to replace %f
> with the corresponding file name. This change is made for both table
> files and LO TOC files.
>
> With these core changes, the rest of the code continues working as-is.
>
> We are attaching 4 patches for this change :
>
> 001-pg_dump_pipe has the pg_dump pipe support code.
> 002-pg_restore_pipe has the pg_restore pipe support.
> 003-pg_dump_basic_tests has a few basic validation tests for
> correctmflag combinations. We need to write more automated tests in
> 002_pg_dump.pl but have been running into some issues with environment
> setup due to which certain pipe commands result in the shell process
> becoming defunct. These same commands are working fine in manual
> testing. We are still looking into this.
> 004-pg_dump_documentation has the proposed documentation changes.
>
> We are working on the above test issues and cleanup of the patches.
>
> Open Questions : There are a couple of open questions in the
> implementation :
>
> 1. Currently the LO TOC file (blob_NNN.toc) is opened in the
> append mode. This is not possible with popen for the pipe command.
> From reading the code, it seems to us that this file doesn't need to
> be opened in the append mode. As '_StartLOs' is called once per
> archive entry in WriteDataChunksForToCEntry followed by the dumper
> function and then '_EndLOs', it should be okay to change this to 'w'
> mode. But this code has been there since the start so we haven't made
> that change yet. In the patch, we have changed it to 'w' pipe-command
> only and added the ideas for potential solutions in the comments.
> 2. We are also not sure yet on how to handle the environment
> issues when trying to add new tests to 002_pg_dump.pl.
>
> Please let us know what you think.
>
> Thanks & Regards,
> Nitin Motiani
> Google
>
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-04-07 19:51:17 | Re: Draft for basic NUMA observability |
Previous Message | Melanie Plageman | 2025-04-07 19:24:43 | Re: BAS_BULKREAD vs read stream |