We're considering a change to our backup strategy to reduce WAN
bandwidth and archival storage space. I'm posting a description here
both because it may be of value to someone else, and because someone
might be able to poke a hole in it before we go ahead with
implementation. All servers are Linux (SuSE Enterprise 10).
Besides the PITR backups, we have some additional safety nets, which
I'll describe up front. These are not changing.
We have 72 counties which each have a database server holding the
official record for that county. Our application framework passes
transaction images captured by our middle tier on each county database
server to a process which replicates to four central databases, each of
which holds a copy of all counties. This process detects idle time on
the servers and uses it to run a sync process to confirm that the
central copies contain accurate replicas of the counties, reporting any
differences and correcting the central databases to match the counties.
It also saves the transaction images to two transaction repositories
where they are kept for a little over a year.
Our current PostgreSQL backup strategy is:
(1) Our archive script copies WAL files to a directory on the database
server, using cp to a one directory followed by mv to another (to
prevent partial files from being processed).
(2) We have a crontab job on each county database server to push WAL
files to a backup server in the county using rsync.
(3) We have a crontab job on our central backup server to pull the WAL
files from the county database servers using rsync.
(4) We do a PITR base backup once per week on each county database
server, using cpio piped to gzip and split, saving the backup on the
database server.
(5) We have a crontab job on each county database server to push base
backup files to a backup server in the county using rsync.
(6) We have a crontab job on our central backup server to pull the
base backup files from the county database servers using rsync.
(7) We have one warm standby per county on our central backup server
which applies the WAL files as they arrive. These are monitored to
ensure that they are running, in recovery mode, and are reasonably
current -- to detect problems with transfer of the WAL files.
(8) We have a crontab job which notices the arrival of a new base
backup, and uses it for a new basis for the related warm standby,
staring that automatically in place of the prior one.
(9) We have crontab scripts which detect when we have received a new
base backup from a county which has successfully come up in warm standby
mode and is the first base backup for that county for the month. It
captures that base backup, and the WAL files required to start it (based
on the .backup file) to a mirrored SAN archive volume, for one year
retention.
This runs well with little human intervention, but we're having
problems with both the WAN network bandwidth and the space needed for
the monthly archives. Our solution is to abandon compression, instead
using hard links and rsync. The steps involving WAL files would not
change, but most things involving PITR base backups would do a recursive
hard link copy of a the previous backup, followed by an rsync from the
new backup image. We should get our WAN traffic for base backups down
to a fraction of its current volume, and the hard links look to save
considerably more space than the gzip compression.
For (4) we will use 'cp -rl' to copy the previous backup to a new
directory, then rsync (with --delete
--include='/pg_xlog/archive_status/' --exclude='/pg_xlog/*') from the
production database to the hard link copy.
For (5) we'll do a hard link copy on the backup server before the rsync
of the directory tree for the base backup.
For (6) we will do a hard link recursive copy from the previous backup,
rsync from the warm standby, then rsync from the county backup. We
expect to be bringing back a pretty minimal delta change set based on
the differences between the warm standby and the recently completed
backup of the county database, especially since these will normally be
running off-hours on a weekend.
For (8) we will do a full copy (no links) from the new backup image to
get the starting point for the new warm standby.
For (9) we will do a recursive hard link copy of the previous month's
base backup and rsync the new month's backup onto it. We realize that
if one of the linked files is damaged, it can affect some or all
archival backups for that county. That has been deemed an acceptable
risk.
Comments?
-Kevin