From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: design for parallel backup |
Date: | 2020-04-21 18:01:28 |
Message-ID: | CA+TgmobZGVGXVrHSmCwSxaewQNShLa5gQXxgh0oSzoOaPGmwmw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> It's all CRC overhead. I don't see a difference with
> --manifest-checksums=none anymore. We really should look for a better
> "fast" checksum.
Hmm, OK. I'm wondering exactly what you tested here. Was this over
your 20GiB/s connection between laptop and workstation, or was this
local TCP? Also, was the database being read from persistent storage,
or was it RAM-cached? How do you expect to take advantage of I/O
parallelism without multiple processes/connections?
Meanwhile, I did some local-only testing on my new 16GB MacBook Pro
laptop with all combinations of:
- UNIX socket, local TCP socket, local TCP socket with SSL
- Plain format, tar format, tar format with gzip
- No manifest ("omit"), manifest with no checksums, manifest with
CRC-32C checksums, manifest with SHA256 checksums.
The database is a fresh scale-factor 1000 pgbench database. No
concurrent database load. Observations:
- UNIX socket was slower than a local TCP socket, and about the same
speed as a TCP socket with SSL.
- CRC-32C is about 10% slower than no manifest and/or no checksums in
the manifest. SHA256 is 1.5-2x slower, but less when compression is
also used (see below).
- Plain format is a little slower than tar format; tar with gzip is
typically >~5x slower, but less when the checksum algorithm is SHA256
(again, see below).
- SHA256 + tar format with gzip is the slowest combination, but it's
"only" about 15% slower than no manifest, and about 3.3x slower than
no compression, presumably because the checksumming is slowing down
the server and the compression is slowing down the client.
- Fastest speeds I see in any test are ~650MB/s, and slowest are
~65MB/s, obviously benefiting greatly from the fact that this is a
local-only test.
- The time for a raw cp -R of the backup directory is about 10s, and
the fastest time to take a backup (tcp+tar+m:omit) is about 22s.
- In all cases I've checked so far both pg_basebackup and the server
backend are pegged at 98-100% CPU usage. I haven't looked into where
that time is going yet.
Full results and test script attached. I and/or my colleagues will try
to test out some other environments, but I'm not sure we have easy
access to anything as high-powered as a 20GiB/s interconnect.
It seems to me that the interesting cases may involve having lots of
available CPUs and lots of disk spindles, but a comparatively slow
pipe between the machines. I mean, if it takes 36 hours to read the
data from disk, you can't realistically expect to complete a full
backup in less than 36 hours. Incremental backup might help, but
otherwise you're just dead. On the other hand, if you can read the
data from the disk in 2 hours but it takes 36 hours to complete a
backup, it seems like you have more justification for thinking that
the backup software could perhaps do better. In such cases efficient
server-side compression may help a lot, but even then, I wonder
whether you can you read the data at maximum speed with only a single
process? I tend to doubt it, but I guess you only have to be fast
enough to saturate the network. Hmm.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment | Content-Type | Size |
---|---|---|
results.txt | text/plain | 1.8 KB |
test_backup_speed.pl | text/x-perl-script | 1.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2020-04-21 18:01:53 | Re: HEAPDEBUGALL is broken |
Previous Message | Justin Pryzby | 2020-04-21 16:45:10 | Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables |