Quick Links

Re: CREATE DATABASE with filesystem cloning

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: CREATE DATABASE with filesystem cloning
Date:	2023-10-09 23:48:27
Message-ID:	20231009234827.k5t2iz4bss7dwanp@awork3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2023-10-07 18:51:45 +1300, Thomas Munro wrote:
> It should be a lot faster, and use less physical disk, than the two
> existing strategies on recent-ish XFS, BTRFS, very recent OpenZFS,
> APFS (= macOS), and it could in theory be extended to other systems
> that invented different system calls for this with more work (Solaris,
> Windows). Then extra physical disk space will be consumed only as the
> two clones diverge.

> It's just like the old strategy=file_copy, except it asks the OS to do
> its best copying trick. If you try it on a system that doesn't
> support copy-on-write, then copy_file_range() should fall back to
> plain old copy, but it might still be better than we could do, as it
> can push copy commands to network storage or physical storage.
>
> Therefore, the usual caveats from strategy=file_copy also apply here.
> Namely that it has to perform checkpoints which could be very
> expensive, and there are some quirks/brokenness about concurrent
> backups and PITR. Which makes me wonder if it's worth pursuing this
> idea. Thoughts?

I think it'd be interesting to have. For the regression tests we do end up
spending a lot of disk throughput on contents duplicated between
template0/template1/postgres. And I've plenty of time spent time copying huge
template databases, to have a reproducible starting point for some benchmark
that's expensive to initialize.

If we do this, I think we should consider creating template0, template1 with
the new strategy, so that a new initdb cluster ends up with deduplicated data.

FWIW, I experimented with using cp -c on macos for the initdb template, and
that provided some further gain. I suspect that that gain would increase if
template0/template1/postgres were deduplicated.

> diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
> index e04bc3941a..8c963ff548 100644
> --- a/src/backend/storage/file/copydir.c
> +++ b/src/backend/storage/file/copydir.c
> @@ -19,14 +19,21 @@
> #include "postgres.h"
>
> #include <fcntl.h>
> +#include <limits.h>
> #include <unistd.h>
>
> +#ifdef HAVE_COPYFILE_H
> +#include <copyfile.h>
> +#endif

We already have code around this in src/bin/pg_upgrade/file.c, seems we ought
to move it somewhere in src/port?

Greetings,

Andres Freund

In response to

CREATE DATABASE with filesystem cloning at 2023-10-07 05:51:45 from Thomas Munro

Responses

Re: CREATE DATABASE with filesystem cloning at 2023-10-11 12:35:46 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	vignesh C	2023-10-10 00:43:17	Re: typo in couple of places
Previous Message	Peter Geoghegan	2023-10-09 23:46:26	Re: post-recovery amcheck expectations