Re: refactoring basebackup.c

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: refactoring basebackup.c
Date: 2020-07-31 16:49:39
Message-ID: 20200731164939.zu2hvv5ooto275kp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-07-29 11:31:26 -0400, Robert Haas wrote:
> Here's an updated patch set. This is now rebased over master and
> includes as 0001 the patch I posted separately at
> http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@mail.gmail.com
> but drops some other patches that were committed meanwhile. 0002-0009
> of this series are basically the same as 0004-0011 from the previous
> series, except for rebasing and fixing a bug I discovered in what's
> now 0006. 0012 does a refactoring of pg_basebackup along similar lines
> to the server-side refactoring from patches earlier in the series.

Have you tested whether this still works against older servers? Or do
you think we should not have that as a goal?

> 1. pg_basebackup -R injects recovery.conf (on older versions) or
> injects standby.signal and appends to postgresql.auto.conf (on newer
> versions) by parsing the tar file sent by the server and editing it on
> the fly. From the point of view of server-side compression, this is
> not ideal, because if you want to make these kinds of changes when
> server-side compression is in use, you'd have to decompress the stream
> on the client side in order to figure out where in the steam you ought
> to inject your changes. But having to do that is a major expense. If
> the client instead told the server what to change when generating the
> archive, and the server did it, this expense could be avoided. It
> would have the additional advantage that the backup manifest could
> reflect the effects of those changes; right now it doesn't, and
> pg_verifybackup just knows to expect differences in those files.

Hm. I don't think I terribly like the idea of things like -R having to
be processed server side. That'll be awfully annoying to keep working
across versions, for one. But perhaps the config file should just not be
in the main tar file going forward?

I think we should eventually be able to use one archive for multiple
purposes, e.g. to set up a standby as well as using it for a base
backup. Or multiple standbys with different tablespace remappings.

> 2. According to the comments, some tar programs require two tar blocks
> (i.e. 512-byte blocks) of zero bytes at the end of an archive. The
> server does not generate these blocks of zero bytes, so it basically
> creates a tar file that works fine with my copy of tar but might break
> with somebody else's. Instead, the client appends 1024 zero bytes to
> the end of every file it receives from the server. That is an odd way
> of fixing this problem, and it makes things rather inflexible. If the
> server sends you any kind of a file OTHER THAN a tar file with the
> last 1024 zero bytes stripped off, then adding 1024 zero bytes will be
> the wrong thing to do. It would be better if the server just generated
> fully correct tar files (whatever we think that means) and the client
> wrote out exactly what it got from the server. Then, we could have the
> server generate cpio archives or zip files or gzip-compressed tar
> files or lz4-compressed tar files or anything we like, and the client
> wouldn't really need to care as long as it didn't need to extract
> those archives. That seems a lot cleaner.

Yea.

> 5. As things stand today, the client must know exactly how many
> archives it should expect to receive from the server and what each one
> is. It can do that, because it knows to expect one archive per
> tablespace, and the archive must be an uncompressed tarfile, so there
> is no ambiguity. But, if the server could send archives to other
> places, or send other kinds of archives to the client, then this would
> become more complex. There is no intrinsic reason why the logic on the
> client side can't simply be made more complicated in order to cope,
> but it doesn't seem like great design, because then every time you
> enhance the server, you've also got to enhance the client, and that
> limits cross-version compatibility, and also seems more fragile. I
> would rather that the server advertise the number of archives and the
> names of each archive to the client explicitly, allowing the client to
> be dumb unless it needs to post-process (e.g. extract) those archives.

ISTM that that can help to some degree, but things like tablespace
remapping etc IMO aren't best done server side, so I think the client
will continue to need to know about the contents to a significnat
degree?

> Putting all of the above together, what I propose - but have not yet
> tried to implement - is a new COPY sub-protocol for taking base
> backups. Instead of sending a COPY stream per archive, the server
> would send a single COPY stream where the first byte of each message
> is a type indicator, like we do with the replication sub-protocol
> today. For example, if the first byte is 'a' that could indicate that
> we're beginning a new archive and the rest of the message would
> indicate the archive name and perhaps some flags or options. If the
> first byte is 'p' that could indicate that we're sending archive
> payload, perhaps with the first four bytes of the message being
> progress, i.e. the number of newly-processed bytes on the server side
> prior to any compression, and the remaining bytes being payload. On
> receipt of such a message, the client would increment the progress
> indicator by the value indicated in those first four bytes, and then
> process the remaining bytes by writing them to a file or whatever
> behavior the user selected via -Fp, -Ft, -Z, etc.

Wonder if there's a way to get this to be less stateful. It seems a bit
ugly that the client would know what the last 'a' was for a 'p'? Perhaps
we could actually make 'a' include an identifier for each archive, and
then 'p' would append to a specific archive? Which would then also would
allow for concurrent processing of those archives on the server side.

I'd personally rather have a separate message type for progress and
payload. Seems odd to have to send payload messages with 0 payload just
because we want to update progress (in case of uploading to
e.g. S3). And I think it'd be nice if we could have a more extensible
progress measurement approach than a fixed length prefix. E.g. it might
be nice to allow it to report both the overall progress, as well as a
per archive progress. Or we might want to send progress when uploading
to S3, even when not having pre-calculated the total size of the data
directory.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-07-31 17:00:08 Re: Why is pq_begintypsend so slow?
Previous Message Robert Haas 2020-07-31 16:42:51 Re: new heapcheck contrib module