Re: After upgrade to 9.3, streaming replication fails to start

From: Jeff Ross <jross(at)wykids(dot)org>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: After upgrade to 9.3, streaming replication fails to start
Date: 2013-11-06 19:26:08
Message-ID: 527A97D0.1030108@wykids.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On 11/6/13, 11:32 AM, Jeff Janes wrote:
> On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross(at)wykids(dot)org
> <mailto:jross(at)wykids(dot)org>> wrote:
>
>
> _postgresql(at)nirvana:/var/postgresql $ cat start_hot_standby.sh
> #!/bin/sh
> backup_label=wykids_`date +%Y-%m-%d`
> #remove any existing wal files on the standby
> ssh dukkha.internal rm -rf /wal/*
> #stop the standby server if it is running
> ssh dukkha.internal sudo /usr/local/bin/svc -d
> /service/postgresql.5432
> psql -c "select pg_start_backup('$backup_label');" template1
> rsync \
> --copy-links \
> --delete \
> --exclude=backup_label \
>
>
>
> Excluding backup_label is exactly the wrong thing to do. The only
> reason backup_label is created in the first place is so that it can be
> copied to the replica, where it is needed. It's existence on the
> master is a nuisance.
>
>
> --exclude=postgresql.conf \
> --exclude=recovery.done \
> -e ssh -avz /var/postgresql/data.93.5432/ \
> dukkha.internal:/var/postgresql/data.93.5432/
> ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
> ssh dukkha.internal rm -f
> /var/postgresql/data.93.5432/pg_xlog/archive_status/*
> ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
> ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
> ssh dukkha.internal ln -s /var/postgresql/recovery.conf
> /var/postgresql/data.93.5432/recovery.conf
> psql -c "select pg_stop_backup();" template1
> ssh dukkha.internal sudo /usr/local/bin/svc -u
> /service/postgresql.5432
>
>
> _postgresql(at)nirvana:/var/postgresql $ sh -x start_hot_standby.sh
> + date +%Y-%m-%d
> + backup_label=wykids_2013-11-06
> + ssh dukkha.internal rm -rf /wal/*
> + ssh dukkha.internal sudo /usr/local/bin/svc -d
> /service/postgresql.5432
> + rsync -e ssh /wal/ dukkha.internal:/wal/
> skipping directory .
>
>
>
> Where is the above rsync coming from? It doesn't seem to be in the
> shell script you showed.
>
> Anyway, I think you need to copy the wal over after you call
> pg_stop_backup, not before you call pg_start_backup.
>
> Cheers,
>
> Jeff

Hi Jeff,

Thanks for the reply. Oops, I copied one of the many changes to the
script, but not the one with the rsync to copy /wal from the primary to
the standby.

I should have mentioned that wal archiving is setup and working from the
primary to the standby. It saves wal both on the locally on the primary
and remotesly on the standby.

I moved the rsync line to copy wal from primary to secondary after
pg_stop_backup but I'm still getting the same panic on the standby.

Here's the real, honest version of the script I use to start the hot
standby:

_postgresql(at)nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the secondary
ssh dukkha.internal "rm -rf /wal/*"
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
--copy-links \
--delete \
--exclude=backup_label \
--exclude=postgresql.conf \
--exclude=recovery.done \
-e ssh -avz /var/postgresql/data.93.5432/ \
dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
ssh dukkha.internal "rm -f
/var/postgresql/data.93.5432/pg_xlog/archive_status/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
ssh dukkha.internal "ln -s /var/postgresql/recovery.conf
/var/postgresql/data.93.5432/recovery.conf"
psql -c "select pg_stop_backup();" template1
rsync -e ssh -avz /wal/ dukkha.internal:/wal/
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432

Here are the logs on the standby after running the above:

2013-11-06 11:56:30.792461500 <%> LOG: database system was interrupted;
last known up at 2013-11-06 11:52:22 MST
2013-11-06 11:56:30.800685500 <%> LOG: entering standby mode
2013-11-06 11:56:30.800891500 <%> LOG: invalid primary checkpoint record
2013-11-06 11:56:30.800930500 <%> LOG: invalid secondary checkpoint record
2013-11-06 11:56:30.801004500 <%> PANIC: could not locate a valid
checkpoint record

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message zach cruise 2013-11-06 21:08:39 upgrading to 9.3
Previous Message Zev Benjamin 2013-11-06 19:07:40 Re: Full text search on partial URLs