Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-04-02 09:25:10
Message-ID: d1df3598-487b-4397-8be5-631060ae060e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/1/24 23:45, Thomas Munro wrote:
> ...
>>
>> I was very puzzled by the awful performance on ZFS. When every other fs
>> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
>> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
>> advice I could think of, with almost no effect.
>>
>> Ultimately I decided that it probably is the "no readahead" behavior
>> I've observed on ZFS. I assume it's because it doesn't use the page
>> cache where the regular readahead is detected etc. And there's no
>> prefetching in pg_combinebackup, so I decided to an experiment and added
>> a trivial explicit prefetch when reconstructing the file - every time
>> we'd read data from a file, we do posix_fadvise for up to 128 blocks
>> ahead (similar to what bitmap heap scan code does). See 0002.
>>
>> And tadaaa - the duration dropped from 900-1000 seconds to only about
>> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
>> pretty massive.
>
> Interesting. ZFS certainly has its own prefetching heuristics with
> lots of logic and settings, but it could be that it's using
> strict-next-block detection of access pattern (ie what I called
> effective_io_readahead_window=0 in the streaming I/O thread) instead
> of a window (ie like the Linux block device level read ahead where,
> AFAIK, if you access anything in that sliding window it is triggered),
> and perhaps your test has a lot of non-contiguous but close-enough
> blocks? (Also reminds me of the similar discussion on the BHS thread
> about distinguishing sequential access from
> mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
> fine line between them.)
>

I don't think the files have a lot of non-contiguous but close-enough
blocks (it's rather that we'd skip blocks that need to come from a later
incremental file). The backups are generated to have a certain fraction
of modified blocks.

For example the smallest backup has 1% means 99% of blocks comes from
the base backup, and 1% comes from the increment. And indeed, the whole
database is ~75GB and the backup is ~740MB. Which means that on average
there will be runs of 99 blocks in the base backup, then skip 1 block
(to come from the increment), and then again 99-1-99-1. So it's very
sequential, almost no holes, and the increment is 100% sequential. And
it still does not seem to prefetch anything.

> You could double-check this and related settings (for example I think
> it might disable itself automatically if you're on a VM with small RAM
> size):
>
> https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable
>

I haven't touched that parameter at all, and it's "enabled" by default:

# cat /sys/module/zfs/parameters/zfs_prefetch_disable
0

While trying to make the built-in prefetch work I reviewed the other
parameters with the "prefetch" tag, without success. And I haven't seen
any advice on how to make it work ...

>> There's a couple more interesting ZFS details - the prefetching seems to
>> be necessary even when using copy_file_range() and don't need to read
>> the data (to calculate checksums). This is why the "manifest=off" chart
>> has the strange group of high bars at the end - the copy cases are fast
>> because prefetch happens, but if we switch to copy_file_range() there
>> are no prefetches and it gets slow.
>
> Hmm, at a guess, it might be due to prefetching the dnode (root object
> for a file) and block pointers, ie the structure but not the data
> itself.
>

Yeah, that's possible. But the effects are the same - it doesn't matter
what exactly is not prefetched. But perhaps we could prefetch just a
tiny part of the record, enough to prefetch the dnode+pointers, not the
whole record. Might save some space in ARC, perhaps?

>> This is a bit bizarre, especially because the manifest=on cases are
>> still fast, exactly because the pread + prefetching still happens. I'm
>> sure users would find this puzzling.
>>
>> Unfortunately, the prefetching is not beneficial for all filesystems.
>> For XFS it does not seem to make any difference, but on BTRFS it seems
>> to cause a regression.
>>
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
>
> Seems reasonable if you can't fix it by tuning ZFS. (Might also be an
> interesting research topic for a potential ZFS patch:
> prefetch_swiss_cheese_window_size. I will not be nerd-sniped into
> reading the relevant source today, but I'll figure it out soonish...)
>

It's entirely possible I'm just too stupid and it works just fine for
everyone else. But maybe not, and I'd say an implementation that is this
difficult to configure is almost as if it didn't exist at all. The linux
read-ahead works by default pretty great.

So I don't see how to make this work without explicit prefetch ... Of
course, we could also do no prefetch and tell users it's up to ZFS to
make this work, but I don't think it does them any service.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2024-04-02 09:31:17 Re: Reports on obsolete Postgres versions
Previous Message Bertrand Drouvot 2024-04-02 08:59:34 Re: Synchronizing slots from primary to standby