Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-03-31 05:42:55
Message-ID: c82137eb-460e-41ca-be78-1bb32829bf1f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/31/24 06:46, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> I'm on 2.2.2 (on Linux). But there's something wrong, because the
>> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>>
>> I'm not sure it's a ZFS config issue, though, because it's not CPU or
>> I/O bound, and I see this on both machines. And some simple dd tests
>> show the zpool can do 10x the throughput. Could this be due to the file
>> header / pool alignment?
>
> Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
> the same 128kB record as you copy_file_range 16 x 8kB blocks?
> (Guessing you might be using the default recordsize?)
>

No, I reduced the record size to 8kB. And the pgbench init takes about
the same as on other filesystems on this hardware, I think. ~10 minutes
for scale 5000.

>> I admit I'm not very familiar with the format, but you're probably right
>> there's a header, and header_length does not seem to consider alignment.
>> make_incremental_rfile simply does this:
>>
>> /* Remember length of header. */
>> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
>> sizeof(rf->truncation_block_length) +
>> sizeof(BlockNumber) * rf->num_blocks;
>>
>> and sendFile() does the same thing when creating incremental basebackup.
>> I guess it wouldn't be too difficult to make sure to align this to
>> BLCKSZ or something like this. I wonder if the file format is documented
>> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>>
>> Anyway, is that really a problem? I mean, in my tests the CoW stuff
>> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
>> that's why it took longer on XFS ...
>
> Yeah I'm not sure, I assume it did more allocating and copying because
> of that. It doesn't matter and it would be fine if a first version
> weren't as good as possible, and fine if we tune the format later once
> we know more, ie leaving improvements on the table. I just wanted to
> share the observation. I wouldn't be surprised if the block-at-a-time
> coding makes it slower and maybe makes the on disk data structures
> worse, but I dunno I'm just guessing.
>
> It's also interesting but not required to figure out how to tune ZFS
> well for this purpose right now...

No idea. Any idea if there's some good ZFS statistics to check?

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-03-31 05:44:20 Re: [HACKERS] make async slave to wait for lsn to be replayed
Previous Message Andrey M. Borodin 2024-03-31 05:10:43 Re: [PATCH] pgbench log file headers