| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: design for parallel backup | 
| Date: | 2020-04-21 20:14:35 | 
| Message-ID: | 20200421201435.ptsfxwdhfjhd4t2s@alap3.anarazel.de | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2020-04-21 14:01:28 -0400, Robert Haas wrote:
> On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > It's all CRC overhead. I don't see a difference with
> > --manifest-checksums=none anymore. We really should look for a better
> > "fast" checksum.
>
> Hmm, OK. I'm wondering exactly what you tested here. Was this over
> your 20GiB/s connection between laptop and workstation, or was this
> local TCP?
It was local TCP. The speeds I can reach are faster than the 10GiB/s
(unidirectional) I can do between the laptop & workstation, so testing
it over "actual" network isn't informative - I basically can reach line
speed between them with any method.
> Also, was the database being read from persistent storage, or was it
> RAM-cached?
It was in kernel buffer cache. But I can reach 100% utilization of
storage too (which is slightly slower than what I can do over unix
socket).
pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null
2.59GiB/s
find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null
2.53GiB/s
find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null
2.42GiB/s
I tested this with a -s 5000 DB, FWIW.
> How do you expect to take advantage of I/O parallelism without
> multiple processes/connections?
Which kind of I/O parallelism are you thinking of? Independent
tablespaces? Or devices that can handle multiple in-flight IOs? WRT the
latter, at least linux will keep many IOs in-flight for sequential
buffered reads.
> - UNIX socket was slower than a local TCP socket, and about the same
> speed as a TCP socket with SSL.
Hm. Interesting. Wonder if that a question of the unix socket buffer
size?
> - CRC-32C is about 10% slower than no manifest and/or no checksums in
> the manifest. SHA256 is 1.5-2x slower, but less when compression is
> also used (see below).
> - Plain format is a little slower than tar format; tar with gzip is
> typically >~5x slower, but less when the checksum algorithm is SHA256
> (again, see below).
I see about 250MB/s with -Z1 (from the source side). If I hack
pg_basebackup.c to specify a deflate level of 0 to gzsetparams, which
zlib docs says should disable compression, I get up to 700MB/s. Which
still is a factor of ~3.7 to uncompressed.
This seems largely due to zlib's crc32 computation not being hardware
accelerated:
-   99.75%     0.05%  pg_basebackup  pg_basebackup       [.] BaseBackup
   - 99.95% BaseBackup
      - 81.60% writeTarData
         - gzwrite
         - gz_write
            - gz_comp.constprop.0
               - 85.11% deflate
                  - 97.66% deflate_stored
                     + 87.45% crc32_z
                     + 9.53% __memmove_avx_unaligned_erms
                     + 3.02% _tr_stored_block
                    2.27% __memmove_avx_unaligned_erms
               + 14.86% __libc_write
      + 18.40% pqGetCopyData3
> It seems to me that the interesting cases may involve having lots of
> available CPUs and lots of disk spindles, but a comparatively slow
> pipe between the machines.
Hm, I'm not sure I am following. If network is the bottleneck, we'd
immediately fill the buffers, and that'd be that?
ISTM all of this is only really relevant if either pg_basebackup or
walsender is the bottleneck?
> I mean, if it takes 36 hours to read the
> data from disk, you can't realistically expect to complete a full
> backup in less than 36 hours. Incremental backup might help, but
> otherwise you're just dead. On the other hand, if you can read the
> data from the disk in 2 hours but it takes 36 hours to complete a
> backup, it seems like you have more justification for thinking that
> the backup software could perhaps do better. In such cases efficient
> server-side compression may help a lot, but even then, I wonder
> whether you can you read the data at maximum speed with only a single
> process? I tend to doubt it, but I guess you only have to be fast
> enough to saturate the network. Hmm.
Well, I can do >8GByte/s of buffered reads in a single process
(obviously cached, because I don't have storage quite that fast -
uncached I can read at nearly 3GByte/s, the disk's speed). So sure,
there's a limit to what a single process can do, but I think we're
fairly far away from it.
I think it's fairly obvious that we need faster compression - and that
while we clearly can win a lot by just using a faster
algorithm/implementation than standard zlib, we'll likely also need
parallelism in some form.  I'm doubtful that using multiple connections
and multiple backends is the best way to achieve that, but it'd be a
way.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2020-04-21 20:14:53 | Re: More efficient RI checks - take 2 | 
| Previous Message | Tom Lane | 2020-04-21 20:03:53 | Re: Do we need to handle orphaned prepared transactions in the server? |