Re: pg_upgrade parallelism

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_upgrade parallelism
Date: 2021-11-17 20:34:03
Message-ID: 20211117203403.GE17618@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> Hi,
>
> Currently docs about pg_upgrade says:
>
> """
> <para>
> The <option>--jobs</option> option allows multiple CPU cores to be used
> for copying/linking of files and to dump and reload database schemas
> in parallel; a good place to start is the maximum of the number of
> CPU cores and tablespaces. This option can dramatically reduce the
> time to upgrade a multi-database server running on a multiprocessor
> machine.
> </para>
> """
>
> Which make the user think that the --jobs option could use all CPU
> cores. Which is not true. Or that it has anything to do with multiple
> databases, which is true only to some extent.
>
> What that option really improves are upgrading servers with multiple
> tablespaces, of course if --link or --clone are used pg_upgrade is still
> very fast but used with the --copy option is not what one could expect.

> As an example, a customer with a 25Tb database, 40 cores and lots of ram
> used --jobs=35 and got only 7 processes (they have 6 tablespaces) and
> the disks where not used at maximum speed either. They expected 35
> processes copying lots of files at the same time.

I would test this. How long does it take to cp -r the data dirs vs pg_upgrade
them ? If running 7 "cp" in parallel is faster than the "copy" portion of
pg_upgrade -j7, then pg_upgrade's file copy should be optimized.

But if it's not faster, then maybe should look at other options, like your idea
to copy filenodes (or their segments) in parallel.

> So, first I would like to improve documentation. What about something
> like the attached?

The relevant history is in commits
6f1b9e4efd94fc644f5de5377829d42e48c3c758
a89c46f9bc314ed549245d888da09b8c5cace104

--jobs originally parallelized pg_dump and pg_restore, and then added
copying/linking. So the docs should mention tablespaces, as you said, but
should also mention databases. It may not be an issue for you, but pg_restore
is the slowest part of our pg_upgrades, since we have many partitions.

> Now, a couple of questions:
>
> - in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to
> determine the amount of bytes that should be used in read()/write() to
> copy the relfilenode segments. And we define it as (50 * BLCKSZ),
> which is 400Kb. Isn't this too small?

Maybe - you'll have to check :)

> - why we read()/write() at all? is not a faster way of copying the file?
> i'm asking that because i don't actually know.

No portable way. Linux has this:
https://man7.org/linux/man-pages/man2/copy_file_range.2.html

But I just read:

| First support for cross-filesystem copies was introduced in Linux
| 5.3. Older kernels will return -EXDEV when cross-filesystem
| copies are attempted.

To me that sounds like it may not be worth it, at least not quite yet.
But it would be good to test.

> I'm trying to add more parallelism by copying individual segments
> of a relfilenode in different processes. Does anyone one see a big
> problem in trying to do that? I'm asking because no one did it before,
> that could not be a good sign.

My concern would be if there's too many jobs and the storage bogs down, then it
could be slower.

I think something like that should have a separate option, not just --jobs.
Like --parallel-in-tablespace. The original implementation puts processes
across CPUs (for pg_dump/restore) and tablespaces (for I/O). Maybe it should
be possible to control those with separate options, too.

FWIW, we typically have only one database of any significance, but we do use
tablespaces, and I've used pg_upgrade --link since c. v9.0. --jobs probably
helps pg_dump/restore at few customers who have multiple DBs. But it probably
doesn't help to parallelize --link across tablespaces (since our tablespaces
are actually on the same storage devices, but with different filesystems).
I anticipate it might even make a few customers upgrade a bit slower, since
--link is a metadata operation and probably involves a lot of FS barriers, for
which the storage may be inadequate to support in parallel.

--
Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2021-11-17 20:56:12 Re: Assertion failure with barriers in parallel hash join
Previous Message Andrew Dunstan 2021-11-17 20:26:41 Re: Non-superuser subscription owners