Re: design for parallel backup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: design for parallel backup
Date: 2020-04-21 22:57:06
Message-ID: 20200421225706.vxntxxcoukgthhit@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-04-21 17:09:50 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 4:14 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > It was local TCP. The speeds I can reach are faster than the 10GiB/s
> > (unidirectional) I can do between the laptop & workstation, so testing
> > it over "actual" network isn't informative - I basically can reach line
> > speed between them with any method.
>
> Is that really a conclusive test, though? In the case of either local
> TCP or a fast local interconnect, you'll have negligible latency. It
> seems at least possible that saturating the available bandwidth is
> harder on a higher-latency connection. Cross-region data center
> connections figure to have way higher latency than a local wired
> network, let alone the loopback interface.

Sure. But that's what the TCP window etc should take care of. You might
have to tune the OS if you have a high latency multi-GBit link, but
you'd have to do that regardless of whether a single process or multiple
processes are used. And the number of people with high-latency
multi-gbit links isn't that high, compared to the number taking backups
within a datacenter.

> > It was in kernel buffer cache. But I can reach 100% utilization of
> > storage too (which is slightly slower than what I can do over unix
> > socket).
> >
> > pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null
> > 2.59GiB/s
> > find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null
> > 2.53GiB/s
> > find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null
> > 2.42GiB/s
> >
> > I tested this with a -s 5000 DB, FWIW.
>
> But that's not a real test either, because you're not writing the data
> anywhere. It's going to be a whole lot easier to saturate the read
> side if the write side is always zero latency.

I also stored data elsewhere in separate threads. But the bottleneck of
that is lower (my storage is faster on reads than on writes, at least
after the ram on the nvme is exhausted)...

> > > It seems to me that the interesting cases may involve having lots of
> > > available CPUs and lots of disk spindles, but a comparatively slow
> > > pipe between the machines.
> >
> > Hm, I'm not sure I am following. If network is the bottleneck, we'd
> > immediately fill the buffers, and that'd be that?
> >
> > ISTM all of this is only really relevant if either pg_basebackup or
> > walsender is the bottleneck?
>
> I agree that if neither pg_basebackup nor walsender is the bottleneck,
> parallelism is unlikely to be very effective. I have realized as a
> result of your comments that I actually don't care intrinsically about
> parallel backup; what I actually care about is making backups very,
> very fast. I suspect that parallelism is a useful means to that end,
> but I interpret your comments as questioning that, and specifically
> drawing attention to the question of where the bottlenecks might be.
> So I'm trying to think about that.

I agree that trying to make backups very fast is a good goal (or well, I
think not very slow would be a good descriptor for the current
situation). I am just trying to make sure we tackle the right problems
for that. My gut feeling is that we have to tackle compression first,
because without addressing that "all hope is lost" ;)

FWIW, here's the base backup from pgbench -i -s 5000 compressed a number
of ways. The uncompressed backup is 64622701911 bytes. Unfortunately
pgbench -i -s 5000 is not a particularly good example, it's just too
compressible.

method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format
gzip 1 1 380.79 368.46 12.15 3892457816 16.6 .gz
gzip 6 1 976.05 963.10 12.84 3594605389 18.0 .gz
pigz 1 10 34.35 364.14 23.55 3892401867 16.6 .gz
pigz 6 10 101.27 1056.85 28.98 3620724251 17.8 .gz
zstd-gz 1 1 278.14 265.31 12.81 3897174342 15.6 .gz
zstd-gz 1 6 906.67 893.58 12.52 3598238594 18.0 .gz
zstd 1 1 82.95 67.97 11.82 2853193736 22.6 .zstd
zstd 1 6 228.58 214.65 13.92 2687177334 24.0 .zstd
zstd 1 10 25.05 151.84 13.35 2847414913 22.7 .zstd
zstd 6 10 43.47 374.30 12.37 2745211100 23.5 .zstd
zstd 6 20 32.50 468.18 13.44 2745211100 23.5 .zstd
zstd 9 20 57.99 949.91 14.13 2606535138 24.8 .zstd
lz4 1 1 49.94 36.60 13.33 7318668265 8.8 .lz4
lz4 3 1 201.79 187.36 14.42 6561686116 9.84 .lz4
lz4 6 1 318.35 304.64 13.55 6560274369 9.9 .lz4
pixz 1 10 92.54 925.52 37.00 1199499772 53.8 .xz
pixz 3 10 210.77 2090.38 37.96 1186219752 54.5 .xz
bzip2 1 1 2210.04 2190.89 17.67 1276905211 50.6 .bz2
pbzip2 1 10 236.03 2352.09 34.01 1332010572 48.5 .bz2
plzip 1 10 243.08 2430.18 25.60 915598323 70.6 .lz
plzip 3 10 359.04 3577.94 27.92 1018585193 63.4 .lz
plzip 3 20 197.36 3911.85 22.02 1018585193 63.4 .lz

(zstd-gz is zstd with --format=gzip, zstd with parallelism 1 is with
--single-thread to avoid a separate IO thread it uses by default, even
with -T0)

These weren't taken on a completely quiesced system, and I tested gzip
and bzip2 in parallel, because they took so long. But I think this still
gives a good overview (cpu-user-time is not that affected by smaller
amounts of noise too).

It looks to me that bzip2/pbzip2 are clearly too slow. pixz looks
interesting as it achieves pretty good compression rates at a lower cost
than plzip. plzip's rates are impressive, but damn, is it expensive. And
higher compression ratios using more space is also a bit "huh"?

Does anybody have a better idea what exactly to use as a good test
corpus? pgbench -i clearly sucks, but ...

One thing this reminded me of is whether using a format (tar) that
doesn't allow efficient addressing of individual files is a good idea
for base backups. The compression rates very likely will be better when
not compressing tiny files individually, but at the same time it'd be
very useful to be able to access individual files more efficiently than
O(N). I can imagine that being important for some cases of incremental
backup assembly.

> > I think it's fairly obvious that we need faster compression - and that
> > while we clearly can win a lot by just using a faster
> > algorithm/implementation than standard zlib, we'll likely also need
> > parallelism in some form. I'm doubtful that using multiple connections
> > and multiple backends is the best way to achieve that, but it'd be a
> > way.
>
> I think it has a good chance of being pretty effective, but it's
> certainly worth casting about for other possibilities that might
> deliver more benefit or be less work. In terms of better compression,
> I did a little looking around and it seems like LZ4 is generally
> agreed to be a lot faster than gzip, and also significantly faster
> than most other things that one might choose to use. On the other
> hand, the compression ratio may not be as good; e.g.
> https://facebook.github.io/zstd/ cites a 2.1 ratio (on some data set)
> for lz4 and a 2.9 ratio for zstd. While the compression and
> decompression speeds are slower, they are close enough that you might
> be able to make up the difference by using 2x the cores for
> compression and 3x for decompression. I don't know if that sort of
> thing is worth considering. If your limitation is the line speed, and
> you have have CPU cores to burn, a significantly higher compression
> ratio means significantly faster backups. On the other hand, if you're
> backing up over the LAN and the machine is heavily taxed, that's
> probably not an appealing trade.

I think zstd with a low compression "setting" would be a pretty good
default for most cases. lz4 is considerably faster, true, but the
compression rates are also considerably worse. I think lz4 is great for
mostly in-memory workloads (e.g. a compressed cache / live database with
compressed data, as it allows to have reasonably close to memory speeds
but with twice the data), but for anything longer lived zstd is probably
better.

The other big benefit is that zstd's library has multi-threaded
compression built in, whereas that's not the case for other libraries
that I am aware of.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-04-21 23:03:30 Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables
Previous Message Jehan-Guillaume de Rorthais 2020-04-21 22:41:21 Re: [BUG] non archived WAL removed during production crash recovery