Improving Physical Backup/Restore within the Low Level API

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Improving Physical Backup/Restore within the Low Level API
Date: 2023-10-16 16:26:47
Message-ID: CAKFQuwbpz4s8XP_+Khsif2eFaC78wpTbNbevUYBmjq-UCeNL7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

This email is a first pass at a user-visible design for how our backup and
restore process, as enabled by the Low Level API, can be modified to make
it more mistake-proof. In short, it requires pg_start_backup to further
expand upon what it means for the system to be in the midst of a backup,
pg_stop_backup to reverse those things, and modifying the startup process
to deal with the server having crashed while the system is in that backup
state. Notes at the end extend the design to handle concurrent backups.

The core functional changes are:
1) pg_backup_start modifies a newly added "in backup" state flag in
pg_control to on.
2) pg_backup_stop modifies that flag back to off.
3) postmaster will refuse to start if that flag is on, unless one of:
a) crash.signal exists in the data directory
b) recovery.signal exists in the data directory
c) standby.signal exists in the data directory
4) Signal file processing causes the in-backup flag in pg_control to be set
to off

The newly added crash.signal file is required to handle the case where the
server crashes after pg_backup_start and before pg_backup_stop. It
initiates a crash recovery of the instance just as is done today but with
the added change of flipping the flag to off when recovery is complete just
before going live.

The error message for the failed startup while in backup will tell the dba
that one of the three signal files must exist.
When processing recovery.signal or standby.signal the presence of the
backup_label and tablespace_map files are mandatory and the system will
also fail to start should they be missing.

For non-functional changes I would also suggest doing the following:
pg_backup_start will create a "pg_backup_metadata" directory if there is
not already one, or will empty it if there is.
pg_backup_start will create a crash.signal file in that directory
pg_backup_stop will create files within pg_backup_metadata upon its
completion:
backup_label
tablespace_map
recovery.signal
standby.signal

All of the instructions regarding what to place in those files should be
removed and instead the system should write them - no copy-paste.

The instructions modified to say "copy the backup_label and tablespace_map
files to the root of the backup directory and the recovery and standby
signal files to the pg_backup_metadata directory in the backup.
Additionally, we document crash recovery by saying "move crash.signal from
pg_backup_metadata to the root of the data directory". We should explicitly
advise excluding or removing pg_backup_metadata/crash.signal from the
backup as well.

Extending the above to handle concurrent backup, for pg_control we'd sill
use the on/off flag but we have to have a shared in-memory session lock on
something so that only the last surviving process actually changes it to
off while also dealing with sessions that terminate without issuing
pg_backup_stop and without the server itself crashing. (I'm unfamiliar with
how this is handled today but I presume a mechanism exists already that
just needs to be extended).

For the non-functional stuff, pg_backup_start returns a process id, and
subdirectories under pg_backup_metadata are created named with such. Add a
pg_backup_cleanup() function that executes while not in backup mode to
clean up those subdirectories. Any subdirectory in the backup that isn't
the specified process id from pg_start_backup should be excluded/removed.

David J.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Christofides 2023-10-16 16:29:37 Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Previous Message Robert Haas 2023-10-16 16:25:59 Re: The danger of deleting backup_label