Re: initial sync of multiple streaming slaves simultaneously

From: Mike Roest <mike(dot)roest(at)replicon(dot)com>
To: Lonni J Friedman <netllama(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: initial sync of multiple streaming slaves simultaneously
Date: 2012-09-19 19:51:48
Message-ID: CAE7Byhi6=FqfaB8+4Q_xjWCUr9zEXrr95QVY48PL8X7qtRGO_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Performance.

On our production DB the fast-archiver transfers the datadir in about half
as much time as basebackup.

And since this happens on every failover (since clearing the datadir and
resyncing as if from scratch also takes about half the time as a rsync of
an existing datadir)

--Mike

On Wed, Sep 19, 2012 at 1:34 PM, Lonni J Friedman <netllama(at)gmail(dot)com>wrote:

> Just curious, is there a reason why you can't use pg_basebackup ?
>
> On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest <mike(dot)roest(at)replicon(dot)com>
> wrote:
> >
> >> Is there any hidden issue with this that we haven't seen. Or does
> anyone
> >> have suggestions as to an alternate procedure that will allow 2 slaves
> to
> >> sync concurrently.
> >>
> > With some more testing I've done today I seem to have found an issue with
> > this procedure.
> > When the slave starts up after the sync It reaches what it thinks is a
> > consistent recovery point very fast based on the pg_stop_backup
> >
> > eg:
> > (from the recover script)
> > 2012-09-19 12:15:02: pgsql_start start
> > 2012-09-19 12:15:31: pg_start_backup
> > 2012-09-19 12:15:31: -----------------
> > 2012-09-19 12:15:31: 61/30000020
> > 2012-09-19 12:15:31: (1 row)
> > 2012-09-19 12:15:31:
> > 2012-09-19 12:15:32: NOTICE: pg_stop_backup complete, all required WAL
> > segments have been archived
> > 2012-09-19 12:15:32: pg_stop_backup
> > 2012-09-19 12:15:32: ----------------
> > 2012-09-19 12:15:32: 61/300000D8
> > 2012-09-19 12:15:32: (1 row)
> > 2012-09-19 12:15:32:
> >
> > While the sync was running (but after the pg_stop_backup) I pushed a
> bunch
> > of traffic against the master server. Which got me to a current xlog
> > location of
> > postgres=# select pg_current_xlog_location();
> > pg_current_xlog_location
> > --------------------------
> > 61/6834C450
> > (1 row)
> >
> > The startup of the slave after the sync completed:
> > 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG: database system was
> > interrupted; last known up at 2012-09-19 12:15:31 MDT
> > 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG: creating missing WAL
> > directory "pg_xlog/archive_status"
> > 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG: entering standby mode
> > 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG: streaming replication
> > successfully connected to primary
> > 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG: redo starts at
> 61/30000020
> > 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG: consistent recovery
> state
> > reached at 61/31000000
> > 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG: database system is
> ready to
> > accept read only connections
> >
> > It shows the DB reached a consistent state as of 61/31000000 which is
> well
> > behind the current location of the master (and the data files that were
> > synced over to the slave). And monitoring the server showed the expected
> > slave delay that disappeared as the slave pulled and recovered from the
> WAL
> > files that go generated after the pg_stop_backup.
> >
> > But based on this it looks like this procedure would end up with a
> > indeterminate amount of time (based on how much traffic the master
> processed
> > while the slave was syncing) that the slave couldn't be trusted for fail
> > over or querying as the server is up and running but is not actually in a
> > consistent state.
> >
> > Thinking it through the more complicated script version of the 2 server
> > recovery (where first past the post to run start_backup or stop_backup)
> > would also have this issue (although our failover slave would always be
> the
> > one running stop backup as it syncs faster so at least it would be always
> > consistent but the DR would still have the problem)
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Rafal Pietrak 2012-09-19 20:05:35 Re: foreign key from array element
Previous Message Lonni J Friedman 2012-09-19 19:34:20 Re: initial sync of multiple streaming slaves simultaneously