multithreaded zstd backup compression for client and server

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: multithreaded zstd backup compression for client and server
Date: 2022-03-23 20:34:04
Message-ID: CA+Tgmobj6u-nWF-j=FemygUhobhryLxf9h-wJN7W-2rSsseHNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[ Changing subject line in the hopes of attracting more eyeballs. ]

On Mon, Mar 14, 2022 at 12:11 PM Dipesh Pandit <dipesh(dot)pandit(at)gmail(dot)com> wrote:
> I tried to implement support for parallel ZSTD compression.

Here's a new patch for this. It's more of a rewrite than an update,
honestly; commit ffd53659c46a54a6978bcb8c4424c1e157a2c0f1 necessitated
totally different options handling, but I also redid the test cases,
the documentation, and the error message.

For those who may not have been following along, here's an executive
summary: libzstd offers an option for parallel compression. It's
intended to be transparent: you just say you want it, and the library
takes care of it for you. Since we have the ability to do backup
compression on either the client or the server side, we can expose
this option in both locations. That would be cool, because it would
allow for really fast backup compression with a good compression
ratio. It would also mean that we would be, or really libzstd would
be, spawning threads inside the PostgreSQL backend. Short of cats and
dogs living together, it's hard to think of anything more terrifying,
because the PostgreSQL backend is very much not thread-safe. However,
a lot of the things we usually worry about when people make noises
about using threads in the backend don't apply here, because the
threads are hidden away behind libzstd interfaces and can't execute
any PostgreSQL code. Therefore, I think it might be safe to just ...
turn this on. One reason I think that is that this whole approach was
recommended to me by Andres ... but that's not to say that there
couldn't be problems. I worry a bit that the mere presence of threads
could in some way mess things up, but I don't know what the mechanism
for that would be, and I don't want to postpone shipping useful
features based on nebulous fears.

In my ideal world, I'd like to push this into v15. I've done a lot of
work to improve the backup code in this release, and this is actually
a very small change yet one that potentially enables the project to
get a lot more value out of the work that has already been committed.
That said, I also don't want to break the world, so if you have an
idea what this would break, please tell me.

For those curious as to how this affects performance and backup size,
I loaded up the UK land registry database. That creates a 3769MB
database. Then I backed it up using client-side compression and
server-side compression using the various different algorithms that
are supported in the master branch, plus parallel zstd.

no compression: 3.7GB, 9 seconds
gzip: 1.5GB, 140 seconds with server-side, 141 seconds with client-side
lz4: 2.0GB, 13 seconds with server-side, 12 seconds with client-side

For both parallel and non-parallel zstd compression, I see differences
between the compressed size depending on where the compression is
done. I don't know whether this is an expected behavior of the zstd
library or a bug. Both files uncompress OK and pass pg_verifybackup,
but that doesn't mean we're not, for example, selecting different
compression levels where we shouldn't be. I'll try to figure out
what's going on here.

zstd, client-side: 1.7GB, 17 seconds
zstd, server-side: 1.3GB, 25 seconds
parallel zstd, 4 workers, client-side: 1.7GB, 7.5 seconds
parallel zstd, 4 workers, server-side: 1.3GB, 7.2 seconds

Notice that compressing the backup with parallel zstd is actually
faster than taking an uncompressed backup, even though this test is
all being run on the same machine. That's kind of crazy to me: the
parallel compression is so fast that we save more time on I/O than we
spend compressing. This assumes of course that you have plenty of CPU
resources and limited I/O resources, which won't be true for everyone,
but it's not an unusual situation.

I think the documentation changes in this patch might not be quite up
to scratch. I think there's a brewing problem here: as we add more
compression options, whether or not that happens in this release, and
regardless of what specific options we add, the way things are
structured right now, we're going to end up either duplicating a bunch
of stuff between the pg_basebackup documentation and the BASE_BACKUP
documentation, or else one of those places is going to end up lacking
information that someone reading it might like to have. I'm not
exactly sure what to do about this, though.

This patch contains a trivial adjustment to
PostgreSQL::Test::Cluster::run_log to make it return a useful value
instead of not. I think that should be pulled out and committed
independently regardless of what happens to this patch overall, and
possibly back-patched.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Allow-parallel-zstd-compression-when-taking-a-base-b.patch application/octet-stream 12.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-03-23 20:41:33 Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Previous Message Tom Lane 2022-03-23 20:25:36 Re: Parameter for planner estimate of recursive queries