optimize file transfer in pg_upgrade

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: optimize file transfer in pg_upgrade
Date: 2024-11-06 22:07:35
Message-ID: Zyvop-LxLXBLrZil@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

The attached proof-of-concept patches implement this "catalog-swap" mode
for demonstration purposes. I tested this mode on a cluster with 200
databases, each with 10,000 tables with 1,000 rows and 2 unique constraints
apiece. Each database also had 10,000 sequences. The test used 96 jobs.

pg_upgrade --link --sync-method syncfs --> 10m 23s (~5m linking)
pg_upgrade --catalog-swap --> 5m 32s (~30s linking)

While these results are encouraging, there are a couple of interesting
problems to manage. First, in order to move the data directory from the
old cluster to the new cluster, we will have first moved the new cluster's
data directory (full of files created by pg_restore) aside. After the file
transfer stage, this directory will be filled with useless empty files that
should eventually be deleted. Furthermore, none of these files will have
been synchronized to disk (outside of whatever the kernel has done in the
background), so pg_upgrade's data synchronization step can take a very long
time, even when syncfs() is used (so long that pg_upgrade can take even
longer than before). After much testing, the best way I've found to deal
with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files. Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples. In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files. If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster. Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

There are a couple of other smaller trade-offs with this approach, too.
First, this new mode complicates rollback if, say, the machine loses power
during file transfer. IME the vast majority of failures happen before this
step, and it should be relatively simple to generate a script that will
safely perform the required rollback steps, so I don't think this is a
deal-breaker. Second, this mode leaves around a bunch of files that users
would likely want to clean up at some point. I think the easiest way to
handle this is to just put all these files in the old cluster's data
directory so that the cleanup script generated by pg_upgrade also takes
care of them.

Thoughts?

--
nathan

Attachment Content-Type Size
v1-0001-Export-walkdir.patch text/plain 1.9 KB
v1-0002-Add-void-arg-parameter-to-walkdir-that-is-passed-.patch text/plain 8.4 KB
v1-0003-Introduce-catalog-swap-mode-for-pg_upgrade.patch text/plain 9.5 KB
v1-0004-Add-no-sync-data-files-flag-to-initdb.patch text/plain 7.3 KB
v1-0005-Export-pre_sync_fname.patch text/plain 2.4 KB
v1-0006-In-pg_upgrade-s-catalog-swap-mode-only-sync-files.patch text/plain 4.4 KB
v1-0007-Add-sequence-data-flag-to-pg_dump.patch text/plain 2.8 KB
v1-0008-Avoid-copying-sequence-files-in-pg_upgrade-s-cata.patch text/plain 3.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-11-06 23:39:02 Re: Time to add a Git .mailmap?
Previous Message Tom Lane 2024-11-06 21:33:24 Re: Rename Function: pg_postmaster_start_time