Re: pg_upgrade + replica servers + rsync --size-only is unsafe

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Rob Emery <re-pgsql(at)codeweavers(dot)net>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: pg_upgrade + replica servers + rsync --size-only is unsafe
Date: 2020-12-02 17:28:05
Message-ID: 20201202172805.GV16415@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Greetings,

* Rob Emery (re-pgsql(at)codeweavers(dot)net) wrote:
> We're pretty sure that we've just hit a scenario where the instructions for
> pg_upgrade with standby servers aren't quite safe.

I've pretty much always felt those instructions should come with a
"expert use only" note.

> We were testing our upgrade process with a copy of our live cluster in a
> lab environment and we found that using
>
> rsync --verbose --archive --delete --hard-links --size-only
> --no-inc-recursive /media/postgresql/data/main /media/postgresql/data/9.6
> PGRETESTA02:/media/postgresql/data
>
> Resulted in:
>
> 2020-12-02 14:49:11.513 GMT [20884-1] LOG: database system was shut
> down in recovery at 2020-12-02 13:57:56 GMT
> 2020-12-02 14:49:11.513 GMT [20884-2] LOG: entering standby mode
> 2020-12-02 14:49:11.557 GMT [20884-3] LOG: consistent recovery state
> reached at 19E/25000098
> 2020-12-02 14:49:11.557 GMT [20884-4] LOG: invalid record length at
> 19E/25000098: wanted 24, got 0
> 2020-12-02 14:49:11.559 GMT [20883-1] LOG: database system is ready
> to accept read only connections
> 2020-12-02 14:49:11.593 GMT [20888-1] FATAL: database system
> identifier differs between the primary and standby
> 2020-12-02 14:49:11.593 GMT [20888-2] DETAIL: The primary's
> identifier is 6901669428825624285, the standby's identifier is
> 690161835164
> 1138930.
>
> however if we don't use --size-only, then the process worked fine and we a
> smooth upgrade.

This doesn't look like a --size-only issue though- how did you end up
with this when the pg_controldata file (where the system identifier is
pulled from) shouldn't even exist and won't be a hardlink to the
existing one on the old system since it's a new cluster, and therefore
should definitely be copied?

When you go to run the rsync to get the replica up to date you should
have a system that looks like this:

PRIMARY:

/srv/old_cluster
... lots of files
/srv/new_cluster
... lots of *new* files, including pg_controldata
... hardlinks to PG table/index files that aren't part of the catalog

REPLICA:

/srv/old_cluster
... lots of files
/srv/new_cluster
... entirely empty

Then you run the rsync, at the /srv level, and all those 'new files' in
/srv/new_cluster on the primary should get copied over to
/srv/new_cluster on the replica, while any files which are hard-linked
between old_cluster and new_cluster should end up as hard links on the
replica.

Thanks,

Stephen

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message John Scalia 2020-12-02 18:52:18 Seeing privileges on a schema
Previous Message Rob Emery 2020-12-02 16:18:56 pg_upgrade + replica servers + rsync --size-only is unsafe