From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pg_combinebackup --copy-file-range |
Date: | 2024-04-01 21:45:17 |
Message-ID: | CA+hUKGLvt4un0KecVJ7a8wCx=SncX5d7FZ055-n6QXYoon4a2Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> And I think he's right, and my tests confirm this. I did a trivial patch
> to align the blocks to 8K boundary, by forcing the header to be a
> multiple of 8K (I think 4K alignment would be enough). See the 0001
> patch that does this.
>
> And if I measure the disk space used by pg_combinebackup, and compare
> the results with results without the patch ([1] from a couple days
> back), I see this:
>
> pct not aligned aligned
> -------------------------------------
> 1% 689M 19M
> 10% 3172M 22M
> 20% 13797M 27M
>
> Yes, those numbers are correct. I didn't believe this at first, but the
> backups are valid/verified, checksums are OK, etc. BTRFS has similar
> numbers (e.g. drop from 20GB to 600MB).
Fantastic.
> I think we absolutely need to align the blocks in the incremental files,
> and I think we should do that now. I think 8K would work, but maybe we
> should add alignment parameter to basebackup & manifest?
>
> The reason why I think maybe this should be a basebackup parameter is
> the recent discussion about large fs blocks - it seems to be in the
> works, so maybe better to be ready and not assume all fs have 4K.
>
> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.
+1, and I think BLCKSZ is the right choice.
> I was very puzzled by the awful performance on ZFS. When every other fs
> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
> advice I could think of, with almost no effect.
>
> Ultimately I decided that it probably is the "no readahead" behavior
> I've observed on ZFS. I assume it's because it doesn't use the page
> cache where the regular readahead is detected etc. And there's no
> prefetching in pg_combinebackup, so I decided to an experiment and added
> a trivial explicit prefetch when reconstructing the file - every time
> we'd read data from a file, we do posix_fadvise for up to 128 blocks
> ahead (similar to what bitmap heap scan code does). See 0002.
>
> And tadaaa - the duration dropped from 900-1000 seconds to only about
> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
> pretty massive.
Interesting. ZFS certainly has its own prefetching heuristics with
lots of logic and settings, but it could be that it's using
strict-next-block detection of access pattern (ie what I called
effective_io_readahead_window=0 in the streaming I/O thread) instead
of a window (ie like the Linux block device level read ahead where,
AFAIK, if you access anything in that sliding window it is triggered),
and perhaps your test has a lot of non-contiguous but close-enough
blocks? (Also reminds me of the similar discussion on the BHS thread
about distinguishing sequential access from
mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
fine line between them.)
You could double-check this and related settings (for example I think
it might disable itself automatically if you're on a VM with small RAM
size):
> There's a couple more interesting ZFS details - the prefetching seems to
> be necessary even when using copy_file_range() and don't need to read
> the data (to calculate checksums). This is why the "manifest=off" chart
> has the strange group of high bars at the end - the copy cases are fast
> because prefetch happens, but if we switch to copy_file_range() there
> are no prefetches and it gets slow.
Hmm, at a guess, it might be due to prefetching the dnode (root object
for a file) and block pointers, ie the structure but not the data
itself.
> This is a bit bizarre, especially because the manifest=on cases are
> still fast, exactly because the pread + prefetching still happens. I'm
> sure users would find this puzzling.
>
> Unfortunately, the prefetching is not beneficial for all filesystems.
> For XFS it does not seem to make any difference, but on BTRFS it seems
> to cause a regression.
>
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.
Seems reasonable if you can't fix it by tuning ZFS. (Might also be an
interesting research topic for a potential ZFS patch:
prefetch_swiss_cheese_window_size. I will not be nerd-sniped into
reading the relevant source today, but I'll figure it out soonish...)
> So I took a stab at this in 0007, which detects runs of blocks coming
> from the same source file (limited to 128 blocks, i.e. 1MB). I only did
> this for the copy_file_range() calls in 0007, and the results for XFS
> look like this (complete results are in the PDF):
>
> old (block-by-block) new (batches)
> ------------------------------------------------------
> 1% 150s 4s
> 10% 150-200s 46s
> 20% 150-200s 65s
>
> Yes, once again, those results are real, the backups are valid etc. So
> not only it takes much less space (thanks to block alignment), it also
> takes much less time (thanks to bulk operations).
Again, fantastic.
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2024-04-01 21:47:58 | Re: Statistics Import and Export |
Previous Message | Nathan Bossart | 2024-04-01 21:31:40 | Re: Popcount optimization using AVX512 |