Re: After upgrade to 9.3, streaming replication fails to start--SOLVED

From: Jeff Ross <jross(at)wykids(dot)org>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL <pgsql-general(at)postgresql(dot)org>
Subject: Re: After upgrade to 9.3, streaming replication fails to start--SOLVED
Date: 2013-11-07 16:29:23
Message-ID: 527BBFE3.3050704@wykids.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 11/6/13, 12:26 PM, Jeff Ross wrote:
>
> On 11/6/13, 11:32 AM, Jeff Janes wrote:
>> On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross(at)wykids(dot)org
>> <mailto:jross(at)wykids(dot)org>> wrote:
>>
>>
>> _postgresql(at)nirvana:/var/postgresql $ cat start_hot_standby.sh
>> #!/bin/sh
>> backup_label=wykids_`date +%Y-%m-%d`
>> #remove any existing wal files on the standby
>> ssh dukkha.internal rm -rf /wal/*
>> #stop the standby server if it is running
>> ssh dukkha.internal sudo /usr/local/bin/svc -d
>> /service/postgresql.5432
>> psql -c "select pg_start_backup('$backup_label');" template1
>> rsync \
>> --copy-links \
>> --delete \
>> --exclude=backup_label \
>>
>>
>>
>> Excluding backup_label is exactly the wrong thing to do. The only
>> reason backup_label is created in the first place is so that it can
>> be copied to the replica, where it is needed. It's existence on the
>> master is a nuisance.
>>
>>
>> --exclude=postgresql.conf \
>> --exclude=recovery.done \
>> -e ssh -avz /var/postgresql/data.93.5432/ \
>> dukkha.internal:/var/postgresql/data.93.5432/
>> ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
>> ssh dukkha.internal rm -f
>> /var/postgresql/data.93.5432/pg_xlog/archive_status/*
>> ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
>> ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
>> ssh dukkha.internal ln -s /var/postgresql/recovery.conf
>> /var/postgresql/data.93.5432/recovery.conf
>> psql -c "select pg_stop_backup();" template1
>> ssh dukkha.internal sudo /usr/local/bin/svc -u
>> /service/postgresql.5432
>>
>>
>> _postgresql(at)nirvana:/var/postgresql $ sh -x start_hot_standby.sh
>> + date +%Y-%m-%d
>> + backup_label=wykids_2013-11-06
>> + ssh dukkha.internal rm -rf /wal/*
>> + ssh dukkha.internal sudo /usr/local/bin/svc -d
>> /service/postgresql.5432
>> + rsync -e ssh /wal/ dukkha.internal:/wal/
>> skipping directory .
>>
>>
>>
>> Where is the above rsync coming from? It doesn't seem to be in the
>> shell script you showed.
>>
>> Anyway, I think you need to copy the wal over after you call
>> pg_stop_backup, not before you call pg_start_backup.
>>
>> Cheers,
>>
>> Jeff
>
> Hi Jeff,
>
> Thanks for the reply. Oops, I copied one of the many changes to the
> script, but not the one with the rsync to copy /wal from the primary
> to the standby.
>
> I should have mentioned that wal archiving is setup and working from
> the primary to the standby. It saves wal both on the locally on the
> primary and remotesly on the standby.
>
> I moved the rsync line to copy wal from primary to secondary after
> pg_stop_backup but I'm still getting the same panic on the standby.
>
> Here's the real, honest version of the script I use to start the hot
> standby:
>
> _postgresql(at)nirvana:/var/postgresql $ cat start_hot_standby.sh
> #!/bin/sh
> backup_label=wykids_`date +%Y-%m-%d`
> #remove any existing wal files on the secondary
> ssh dukkha.internal "rm -rf /wal/*"
> ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
> psql -c "select pg_start_backup('$backup_label');" template1
> rsync \
> --copy-links \
> --delete \
> --exclude=backup_label \
> --exclude=postgresql.conf \
> --exclude=recovery.done \
> -e ssh -avz /var/postgresql/data.93.5432/ \
> dukkha.internal:/var/postgresql/data.93.5432/
> ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
> ssh dukkha.internal "rm -f
> /var/postgresql/data.93.5432/pg_xlog/archive_status/*"
> ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
> ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
> ssh dukkha.internal "ln -s /var/postgresql/recovery.conf
> /var/postgresql/data.93.5432/recovery.conf"
> psql -c "select pg_stop_backup();" template1
> rsync -e ssh -avz /wal/ dukkha.internal:/wal/
> ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432
>
> Here are the logs on the standby after running the above:
>
> 2013-11-06 11:56:30.792461500 <%> LOG: database system was
> interrupted; last known up at 2013-11-06 11:52:22 MST
> 2013-11-06 11:56:30.800685500 <%> LOG: entering standby mode
> 2013-11-06 11:56:30.800891500 <%> LOG: invalid primary checkpoint record
> 2013-11-06 11:56:30.800930500 <%> LOG: invalid secondary checkpoint
> record
> 2013-11-06 11:56:30.801004500 <%> PANIC: could not locate a valid
> checkpoint record
>
> Jeff
My apologies to Jeff--I'd missed his in-line comment above that I should
*not* exclude the backup label from the rsync of the primary to the
standby. As soon as I removed that exclusion and with his other
suggested change that I should copy the /wal from the primary to the
standby after pg_stop_backup, streaming replication started on the
standby exactly as it should.

Logs from the standby:

2013-11-07 09:21:15.273712500 <%> LOG: database system was interrupted;
last known up at 2013-11-07 09:16:05 MST
2013-11-07 09:21:15.286834500 <%> LOG: entering standby mode
2013-11-07 09:21:16.873654500 <%> LOG: restored log file
"000000010000000000000050" from archive
2013-11-07 09:21:16.936355500 <%> LOG: redo starts at 0/50000024
2013-11-07 09:21:17.129718500 <%> LOG: consistent recovery state
reached at 0/50036D6C
2013-11-07 09:21:17.131933500 <%> LOG: database system is ready to
accept read only connections
2013-11-07 09:21:17.136856500 cp: /wal/000000010000000000000051: No such
file or directory
2013-11-07 09:21:17.194811500 <%> LOG: started streaming WAL from
primary at 0/51000000 on timeline 1

Jeff Ross

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2013-11-07 18:24:20 Re: problem with partitioned table and indexed json field
Previous Message Tom Lane 2013-11-07 15:51:46 Re: problem with partitioned table and indexed json field