Problem in Postgresql DB backup setup

From: Ujjawal Misra <ujjawal(at)verse(dot)in>
To: pgsql-general(at)postgresql(dot)org
Subject: Problem in Postgresql DB backup setup
Date: 2009-05-21 08:57:35
Message-ID: d9fe24a80905210157q5aa50d92me46c32f5185513f7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

We are facing a problem in setting up a database backup for our production
server using pg-standby.

We are using Postgres 8.3.5 in our production environment. We have an active
user-base of 2 million and need to support 5000 TPS. The service needs to be
up and running 24x7. The production database has a size of 200GB and growing
and is expected to reach 1TB in a couple of months.

In the past we have tried various strategies for db-backup including pg-dump
and Warm-backup using WAL shipping.
But our current needs demand a Hot-backup which can be set-up without
stopping the database server. For this we have evaluated pg-standby with WAL
shipping in a test-environment that has continous db inserts. The set-up
works fine and the slave (pg-standby) is able to consume WAL logs generated
by the master.

We tried to replicate the scenario of the master db going down in this test
environment in the following fashions:
a. Killing the master postgres server (using kill command).
b. Properly stopping the master postgres server (using pg_ctl command).
In both the above cases we saw that after the slave has consumed all the
generated WAL logs and the recovery completes, the slave lags the master by
a set of records. Additionally, the missing records don't seem to be a
consecutive set; a random set (non-consecutive) of records is missing in the
slave as compared with the master (by bringing the master back up).

Following are the steps that we have used for creating the pg-standby slave:
1. SLAVE: Ensure that postgres is not running on the slave. Stop it if it is
running.
Create and INIT a new data directory on the slave
(/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/test_data).
2. SLAVE: Ensure that the slave can pull wal-logs from the MASTER. Authorize
the SLAVE on the MASTER if needed.
3. MASTER: The archive command should be enabled on the master server.
[/usr/local/pgsql/data/postgresql.conf]
===================================
archive_mode = on # allows archiving to be done
# (change requires restart)
archive_command = '/usr/local/pgsql/bin/verseSQLs/storeWAL.sh %p %f
< /dev/null'
4. MASTER: Ensure that the consumeWAL.sh has WAL storage off. This would
automatically be turned on by the backup_rsync.sh script (see stemp 6).
5. SLAVE: The wal-consumption should be enabled on the slave server cron.
The consume_wals.sh script pulls a set of WALs in the shared storage on the
master, so that these WALs can be later consumed on the slave.
#-------------------------------------------------------------------
# [CONSUME WALs from MASTER : VERY CRUCIAL]
#-------------------------------------------------------------------
* * * * * cd /usr/local/pgsql/bin/verseSQLs/consume_wal_scripts &&
./consume_wals.sh
6. MASTER: Run backup_rsync.sh.
It does the following:
a. Enables WAL storage in the storeWAL.sh script.
b. Issues a pg_start_backup() to postgres on master.
c. CD to the data dir of master. rsyncs the contents of the data
directory of the master to the slave.
rsync -avz --delete --exclude=backup_label --exclude=.*
--exclude=*.pid --exclude=*.conf --exclude=pg_xlog --exclude=pg_tblspc
--exclude=pg_log/* --copy-unsafe-links . postgres@
<SLAVE_IP>:/usr/local/pgsql/test_data
d. Issues a pg_stop_backup() to postgres on master.
e. Cleans pg_xlog on slave.
7. SLAVE: Copy recovery.conf to the data dir. Start POSTGRES server. It
starts in STANDBY mode.

The above steps set up the slave server using pg-standby. Note that during
all this process the master db is up and there are continuous batch INSERTs
happening in the db by an automated process. Also note that this automated
process is the only process interacting with the db and there are no other
CRUD operations besides these batch INSERTs.

Further, in order to test the scenario for the master going down follow
these steps:
8. MASTER: If you want to recreate the scenario of the master db going down,
either kill the postgers process or stop the postgres server using pg_ctl.
9. MASTER: Disable the automated process that generates continuous batches
of INSERTs in the db
10. SLAVE: Once you have verified that there are no unconsumed WAL logs
left, trigger the pg-standby that the recovery has completed (touch the
trigger file expected by the pg-standby command specified in recovery.conf).
11. SLAVE: Once pg-standby sees the trigger file, the recovery completes and
the postgres server comes up in the production mode.
12. MASTER: Ensure that the storeWAL.sh has WAL storage off. Though by now
the slave is in a non-recovery mode, this step prevents the unnecessary
consumption of space bye WAL storage.
13. MASTER: Query the count of the records produced by the automated process
on the master.
14. SLAVE: Query the count of the records as present on the slave.
Compare the results of 13 and 14. We observe a difference here.

We would appreciate if you can help us figure out the cause for the
difference in the record set and how can it be avoided.

Thanks and regards,
Ujjawal

Browse pgsql-general by date

  From Date Subject
Next Message Grzegorz Jaśkiewicz 2009-05-21 09:36:52 fsm on 8.1
Previous Message Raymond O'Donnell 2009-05-21 08:53:19 Re: changing the locale of the DB cluster