| From: | Jeff Ross <jross(at)wykids(dot)org> | 
|---|---|
| To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> | 
| Cc: | PostgreSQL <pgsql-general(at)postgresql(dot)org> | 
| Subject: | Re: After upgrade to 9.3, streaming replication fails to start--SOLVED | 
| Date: | 2013-11-07 16:29:23 | 
| Message-ID: | 527BBFE3.3050704@wykids.org | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
On 11/6/13, 12:26 PM, Jeff Ross wrote:
>
> On 11/6/13, 11:32 AM, Jeff Janes wrote:
>> On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross(at)wykids(dot)org 
>> <mailto:jross(at)wykids(dot)org>> wrote:
>>
>>
>>     _postgresql(at)nirvana:/var/postgresql $ cat start_hot_standby.sh
>>     #!/bin/sh
>>     backup_label=wykids_`date +%Y-%m-%d`
>>     #remove any existing wal files on the standby
>>     ssh dukkha.internal rm -rf /wal/*
>>     #stop the standby server if it is running
>>     ssh dukkha.internal sudo /usr/local/bin/svc -d
>>     /service/postgresql.5432
>>     psql -c "select pg_start_backup('$backup_label');" template1
>>     rsync \
>>             --copy-links \
>>             --delete \
>>             --exclude=backup_label \
>>
>>
>>
>> Excluding backup_label is exactly the wrong thing to do.  The only 
>> reason backup_label is created in the first place is so that it can 
>> be copied to the replica, where it is needed.  It's existence on the 
>> master is a nuisance.
>>
>>
>>         --exclude=postgresql.conf \
>>             --exclude=recovery.done \
>>             -e ssh -avz /var/postgresql/data.93.5432/ \
>>             dukkha.internal:/var/postgresql/data.93.5432/
>>     ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
>>     ssh dukkha.internal rm -f
>>     /var/postgresql/data.93.5432/pg_xlog/archive_status/*
>>     ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
>>     ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
>>     ssh dukkha.internal ln -s /var/postgresql/recovery.conf
>>     /var/postgresql/data.93.5432/recovery.conf
>>     psql -c "select pg_stop_backup();" template1
>>     ssh dukkha.internal sudo /usr/local/bin/svc -u
>>     /service/postgresql.5432
>>
>>
>>     _postgresql(at)nirvana:/var/postgresql $ sh -x start_hot_standby.sh
>>     + date +%Y-%m-%d
>>     + backup_label=wykids_2013-11-06
>>     + ssh dukkha.internal rm -rf /wal/*
>>     + ssh dukkha.internal sudo /usr/local/bin/svc -d
>>     /service/postgresql.5432
>>     + rsync -e ssh /wal/ dukkha.internal:/wal/
>>     skipping directory .
>>
>>
>>
>> Where is the above rsync coming from?  It doesn't seem to be in the 
>> shell script you showed.
>>
>> Anyway, I think you need to copy the wal over after you call 
>> pg_stop_backup, not before you call pg_start_backup.
>>
>> Cheers,
>>
>> Jeff
>
> Hi Jeff,
>
> Thanks for the reply.  Oops, I copied one of the many changes to the 
> script, but not the one with the rsync to copy /wal from the primary 
> to the standby.
>
> I should have mentioned that wal archiving is setup and working from 
> the primary to the standby.  It saves wal both on the locally on the 
> primary and remotesly on the standby.
>
> I moved the rsync line to copy wal from primary to secondary after 
> pg_stop_backup but I'm still getting the same panic on the standby.
>
> Here's the real, honest version of the script I use to start the hot 
> standby:
>
> _postgresql(at)nirvana:/var/postgresql $ cat start_hot_standby.sh
> #!/bin/sh
> backup_label=wykids_`date +%Y-%m-%d`
> #remove any existing wal files on the secondary
> ssh dukkha.internal "rm -rf /wal/*"
> ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
> psql -c "select pg_start_backup('$backup_label');" template1
> rsync \
>         --copy-links \
>         --delete \
>         --exclude=backup_label \
>         --exclude=postgresql.conf \
>         --exclude=recovery.done \
>         -e ssh -avz /var/postgresql/data.93.5432/ \
>         dukkha.internal:/var/postgresql/data.93.5432/
> ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
> ssh dukkha.internal "rm -f 
> /var/postgresql/data.93.5432/pg_xlog/archive_status/*"
> ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
> ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
> ssh dukkha.internal "ln -s /var/postgresql/recovery.conf 
> /var/postgresql/data.93.5432/recovery.conf"
> psql -c "select pg_stop_backup();" template1
> rsync -e ssh -avz /wal/ dukkha.internal:/wal/
> ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432
>
> Here are the logs on the standby after running the above:
>
> 2013-11-06 11:56:30.792461500 <%> LOG:  database system was 
> interrupted; last known up at 2013-11-06 11:52:22 MST
> 2013-11-06 11:56:30.800685500 <%> LOG:  entering standby mode
> 2013-11-06 11:56:30.800891500 <%> LOG:  invalid primary checkpoint record
> 2013-11-06 11:56:30.800930500 <%> LOG:  invalid secondary checkpoint 
> record
> 2013-11-06 11:56:30.801004500 <%> PANIC:  could not locate a valid 
> checkpoint record
>
> Jeff
My apologies to Jeff--I'd missed his in-line comment above that I should 
*not* exclude the backup label from the rsync of the primary to the 
standby.  As soon as I removed that exclusion and with his other 
suggested change that I should copy the /wal from the primary to the 
standby after pg_stop_backup, streaming replication started on the 
standby exactly as it should.
Logs from the standby:
2013-11-07 09:21:15.273712500 <%> LOG:  database system was interrupted; 
last known up at 2013-11-07 09:16:05 MST
2013-11-07 09:21:15.286834500 <%> LOG:  entering standby mode
2013-11-07 09:21:16.873654500 <%> LOG:  restored log file 
"000000010000000000000050" from archive
2013-11-07 09:21:16.936355500 <%> LOG:  redo starts at 0/50000024
2013-11-07 09:21:17.129718500 <%> LOG:  consistent recovery state 
reached at 0/50036D6C
2013-11-07 09:21:17.131933500 <%> LOG:  database system is ready to 
accept read only connections
2013-11-07 09:21:17.136856500 cp: /wal/000000010000000000000051: No such 
file or directory
2013-11-07 09:21:17.194811500 <%> LOG:  started streaming WAL from 
primary at 0/51000000 on timeline 1
Jeff Ross
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2013-11-07 18:24:20 | Re: problem with partitioned table and indexed json field | 
| Previous Message | Tom Lane | 2013-11-07 15:51:46 | Re: problem with partitioned table and indexed json field |