Re: 'replication checkpoint has wrong magic' on the newly cloned replicas

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Alex Kliukin <oleksii(at)fastmail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: 'replication checkpoint has wrong magic' on the newly cloned replicas
Date: 2017-11-29 18:44:56
Message-ID: CAOuzzgpDMuXZiMY4h0wFiiaZDxv9=Bw31G0YHV7PFQXLOYi1Jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Greetings,

On Wed, Nov 29, 2017 at 13:33 Alex Kliukin <oleksii(at)fastmail(dot)com> wrote:

>
> On 29. Nov 2017, at 18:52, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
> Greetings,
>
> On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin <oleksii(at)fastmail(dot)com>
> wrote:
>
>> Hi Stephen,
>>
>> > On 29. Nov 2017, at 15:54, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>> >
>> > Greetings,
>> >
>> > * Alex Kliukin (alexk(at)hintbits(dot)com) wrote:
>> >> The cloning itself is done by copying a compressed image via ssh,
>> >> running the
>> >> following command from the replica:
>> >>
>> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \
>> >> --exclude "recovery.done" \
>> >> --exclude "pacemaker_instanz" \
>> >> --exclude "dont_start" \
>> >> --exclude "pg_log" \
>> >> --exclude "pg_xlog" \
>> >> --exclude "postmaster.pid" \
>> >> --exclude "recovery.done" \
>> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C
>> >> {slave_datadir}""
>> >>
>> >> The WAL archiving starts before the copy starts, as the script that
>> >> clones the
>> >> replica checks that the WALs archiving is running before the cloning.
>> >
>> > Maybe you've doing it and haven't mentioned it, but you have to use
>> > pg_start/stop_backup
>>
>> Sorry for not mentioning it, as it seemed obvious, but we are calling
>> pg_start_backup and pg_stop_backup at the right time.
>
>
> Ah, not something I can assume, heh.
>
> Then it depends on which version of PG and if you’re able to run
> start/stop on the replica or not. If you can’t run it on the replica and
> have to run it on the primary (prior to 9.6) then you need to make sure to
> wait for things to happen on the primary and for that to be replicated
> before you can start.
>
>
> We are using exclusive backups from the master. First, the script checks
> that WAL files are shipped to the NFS, where the replica expects to find
> them (we check the md5 checksum of the file in order to make sure that the
> NFS actually delivers the file that the master has archived) . Then
> pg_start_backup runs on the master and its status is checked. On success,
> the copy command runs. When the copy command finishes, pg_stop_backup is
> executed. Once pg_stop_backup finishes successfully, replica configuration
> files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their
> location in the repository and the replica is started.
>

No, you must wait until the replica has moved forward far enough and you
have to copy the backup_label file from the primary as well, otherwise PG
won’t realize you’re doing a backup-based recovery

This is a fairly typical procedure, which, I believe, is also well
> described in the docs.
>

Please provide a link to where that is because if that’s the case then we
need to correct it or remove it. This is absolutely not safe without
additional checks being done and various other magic happening (like
copying the backup_label off the primary where it’s created).

If you’re on 9.6 and using non-exclusive backup, you need to be sure to
> capture the contents of the stop backup and write it into backup_label
> before you start the system back up.
>
>
> We don’t use non-exclusive backups altogether.
>

All the more likely that your procedure is causing more corruption than you
realize then.

Seriously, again, this is not easy to get right, especially when you’re
doing things that weren’t explicitly documented and supported. Using
existing tools from those versed in why the processes used are safe and
have written lots of tests to verify that it is safe is really the
recommendation that you should take away from this.

At least with 9.6 there’s proper documentation on how to run a
non-exclusive backup on a replica properly and if you very carefully follow
the procedure then you may get it right, but you will still want to test
extensively.

Thanks!

Stephen

>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Alex Kliukin 2017-11-29 19:12:28 Re: 'replication checkpoint has wrong magic' on the newly cloned replicas
Previous Message Alex Kliukin 2017-11-29 18:33:15 Re: 'replication checkpoint has wrong magic' on the newly cloned replicas