PITR Archive Recovery, 8 July 2004

What's in this patch?

- All code from previous patches: working now against CVS tip as of now
- Partially complete code for PITR, though not enabled as yet.

This is a full, working patch for discussion and testing, while I complete the additional code.

Archive Recovery Overview

The rule is: when you startup with archive_mode on, take a backup.

This patch provides an additional mode for transactional log recovery.
The current mode provides Crash Recovery, allowing the database to
restart and to read the last few online transaction logs to recover. The
additional mode is known as Archive Recovery, enabled in postgresql.conf
using archive_mode=true.

The administrator also specifies and archive location and a program to
perform the copying (discussed later). As database writes occur and
transaction logs fill they are copied to the archive location by a new
sub-process of the postmaster, called the archiver process.

If the database crashes when in archive_mode, it will recover just as
before. If a major system crash, or other reason to restore occurs, then
the new Archive Recovery mode can be invoked by:
- restoring the full physical backup
- creating DataDir/recovery.conf
- restart PostgreSQL

PostgreSQL will recover using almost the same code path as before,
except: all transaction logs will be restored for use directly from
archive in a stream that prevents disk space from overflowing. Extended
recovery is possible, over many transaction logs - far more than was
previously possible.

The essential aspect to recovery is maintaining a solid backup/log
chain. This must consist of:
i) a full backup of every file in PostgreSQL - miss none!
ii) archived transaction logs from just before backup taken to whenever
you want to recover to...

If you "break the chain" i.e. the two sets of files don't match, or
you've deleted a transaction log or lost a tape - then you will only be
able to recover up to the point the chain broke. The longer the chain,
the easier it will be to break, so backup regularly and work out a
robust archiving policy.

Extensibility
Archive Recovery allows the administrator to specify a program or script
to use when both archiving and restoring transaction log files. As a
result, the administrator can choose to store transaction log files on
any drive, any other system or integrating with any Backup Archive
Recovery (BAR) software.

The Archive Program is provided with 3 parameters, which can be placed
anywhere within the supplied command string. Each parameter is
represented by a "%s" character string.
- The first parameter is replaced with the full path of the transaction
log to be archived.
- The second parameter is replaced with the value of the archive_dest
parameter
e.g. with these settings
archive_mode = true
archive_dest = '/mount/disk2/pgarchive'
archive_program = 'cp %s %s'
with transaction log 00000000000003A4 to archive
with DataDir of web123

would execute the following command
'cp /usr/local/pgsql/web123/00000000000003A4 /disk2/pgarchive'

To enter Archive Recovery the administrator must create a recovery.conf
file. [Currently an empty file will do, though intended as the one place
where all recovery options and parameters would be set]

The Restore Program is provided with 2 parameters
- The first parameter is replaced with the value of the archive_dest
parameter
- The second parameter is replaced with the value of the transaction log
which recovery is requesting should now be restored
- The third parameter is replaced with the full path of the filename
which should be the target of the single file recovery operation.
[In this patch, XlogArchRestoreProgram is hardcoded to "cp %s/%s %s",
pending short discussion on how to specify this...]

Possibilities
Using the current features, it is possible to implement an Automated
Standby Database. This is an active-passive High Availability option. In
this mode, the main server sends archived log files to a second, standby
server. The standby server is set up to be in "permanent recovery", by
using a RestoreProgram that waits for each file to be shipped to it. The
standby system receives each file, then recovers up to that point - so
the standby system is always a few seconds from completing its startup
should it be required.

Patch Status & Current Caveats
- A number of recovery scenarios have been tested and the patch is
believed to be stable and ready for others to begin commentary and
testing...please understand that there are many scenarios that work and
many that do not...these last are not bugs

- Recovery to a specific point in time is not yet implemented. The
administrator has the following recovery options:
--can recover the system fully, all the way to the end of logs
--stop recovery by withholding log files at an appropriate place,
thereby forcing the termination of recovery

How it Works

archive_debug is a hidden postgresql.conf setting which can be used to
show more debug output for the archive recovery facilities.
...archive_debug = true

[This currently generates an additional log file called recovery.log,
showing the record headers and types of commit/abort xlog records.]

Each PostgreSQL user has a corresponding backend process performing work
for them. These backends write any transaction log data to the log as
each operation occurs, then marks transactions as either committed or
aborted. The log is split into log segment files. When each log file
fills a notification file is created in archive_status, and at the same
time a signal is sent to the archiver (via the postmaster) to begin
archiving.

The archiver executes the administrator's program or script using a
system(3) call, archiving each file one at a time.

The notification file provides a number of features. First, its
simplicity ensures that we don't need a shared memory link between each
backend and the archiver, which would then be prone to failure - which
is exactly not what we want. Second, the archive_status file is written
whether or not the postmaster is still up, so will still function
correctly even when the worst has happened. Third, as a persistent
record, it will allow archiving to restart at the same point later on
(even after a backup/restore), allowing it to recover gracefully from
administrator's immediate shutdown requests as well as postmaster or
archive system failures.

When recovery completes, archive_status files are written to ensure
cleanup of all transaction logs leaves the database in a fit state for
production. The recovery.conf file is removed to ensure that a
subsequent crash doesn't accidentally begin recovery from archives...

The archiver process starts only when recovery has fully completed.

How to Fail

All of the following ways to fail have already occurred in testing...

1. if you do all 3 of these, you will not have a backup/log chain:
a) use the 'cp' command, or any other command that can write to either a
file or a directory
b) use archive_dest to specify a directory
c) forget to create the directory, which the command in a) then
interprets as a file and then all archived logs overwrite each other and
break the chain. You will be unable to rollforward AT ALL.

2. if you do all 5 of these, you will not rollforward very far, if any.
a) restore the database backup
b) forget to specify a recovery.conf file
c) startup and let database recover using only the xlogs that were
present when the backup was taken
d) then immediately after recovery run a job which causes xlog to be
written, and then a file to fill
e) use an archive command that allows overwriting of previously archived
files
This sequence causes the database to recover successfully, but not using
your full archive chain. The current redo pointer points half way
through your backup/log chain, so when the file is archived, it
overwrites a file in the middle of the chain, thus breaking the chain.
There isn't an option on cp that will prevent this...

3. turn archive_mode on and off...
Turning archive_mode off means that when an xlog fills, it won't EVER
get archived. This will break the backup/log chain. If you turn
archive_mode back on, and an xlog filled, then it will not be picked up
as having filled and you will be missing a link in the chain. Sometimes
you'll get away with it and then you'll think it will work all the time.
The rule is: when you startup with archive_mode on, take a backup.

4. Forget that xlogs are the same size, whether they are full or
not...you can't tell (yet) by looking at one whether it has a record
within it that will break the chain

How to succeed
- Test your recovery procedures before you need them
- When you're in a recovery situation backup everything you can lay your
hands on, to make sure you have a known position to return to while you
ATTEMPT recovery - you may make a mistake and need to retry
- be cool - we all make mistakes, just don't let errors multiply

- disable listen_addresses in postgresql.conf when you recover, to give
yourself some breathing space to check things, when you finally do get
the database ready
- recover with archive_debug = true to give me/others a chance to debug
any problems or answer any questions you may have

The rule is: when you startup with archive_mode on, take a backup.

When enabling archive_mode for the first time, if you have run
previously without enabling archive_mode, then when you will inevitably
get messages saying "cannot find archive_status file". After you have
taken a backup, you can manually create files in the archive_status
directory of <xlog>.done, which will then allow the bgwriter to clean
them up when it next takes a checkpoint.

Additional Work

- recovery.conf needs some minor work to specify recover options. It is
envisaged that this would be a short bison grammar, very very similar to
postgresql.conf

- It seems possible to easily work around the requirement to take the
backup while the database is open. It would be a good option to have
both hot and cold backup options...

- full documentation will also follow shortly

Credits

This work is the final push in a long series of patches and discussions
about how to achieve archive recovery with PostgreSQL. The work of
J.R.Nield and Patrick MacDonald has provided the detailed underpinnings
for this recent work, which in turn rests upon Vadim Mikheev's original
work on WAL and MVCC. Tom Lane has provided considerable technical
assistance and quality review, whilst Bruce Momjian has provided many of
the ideas and smoothed the way for much of the work. Thanks, all.

Simon Riggs, simon@2ndquadrant.com