Re: Sync Rep: First Thoughts on Code

From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Sync Rep: First Thoughts on Code
Date: 2008-12-11 14:27:12
Message-ID: 20081211142712.GW26596@yugib.highrise.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Simon Riggs <simon(at)2ndQuadrant(dot)com> [081211 05:45]:
>
> On Wed, 2008-12-10 at 15:06 -0500, Aidan Van Dyk wrote:
>
> > Call me think, but I'm confused... In sync rep, there *can't be* any
> > catchign up do do... i.e. if the "slave" isn't accepting the WAL the
> > master "stops" doing *anything*...
>
> In normal/steady state, yes, you are correct. But there is more...
>
> The simplest way to configure standby would be to freeze the primary
> while we setup the standby and then go straight into normal/steady
> state. That could mean hours of downtime for large databases, which is
> unacceptable in a feature aimed at increasing availability. So we need
> to allow the primary to continue working while the standby is setup.
> That then creates a log gap between the LSN of the primary and the LSN
> of the standby, which must be resolved.
>
> So the catchup occurs during the transient initial phase when standby is
> catching up with primary before they continue together in normal/steady
> state.

But "catchup" *has* to be *done* before PostgreSQL can enter "sync rep".

So, if I start PostgreSQL in sync rep mode, without any capable clients
to rep with.... But I'ld rather be buggered there then find out tonight
at 3am that it was in sync rep mode but wasn't really doing sync rep,
becus I'ld messed up something somewhere (firewall, config, password,
anything) and ther ewas not "caught up" client at the time, and I've
just lost a days' worth of my $$$$$ transactions...

> Most of the architectural discussion over last few months has been about
> the need for the initial state and how to handle it. Most of the code
> complexity also.

Well, for me, I'm quite happy with a "restart/stop&start" being a
necessary "downtime" to move to synchronous replication. This way, I
could see a "setup" routing that looks like:
1) Current "production" DB does normal backups/PITR/WAL archiving
2) I setup new "slave", which involves
- restore from backup + wal recover (pg_standby type)
- Could take days+++
- Oh well....
3) Stop production
4) so, now slave is caught up...
5) Start "production" now in sync rep mode as master
6) start slave in sync-rep mode as slave...

So downtime would be limited to the time from the old postmaster
shutdown to the time the slave has replayed the last WAL and connected
to the restarted postmaster as a sync rep slave...

Or am I way too naive to think that a small downtime to "switch" from
non-sync-rep to sync-rep is acceptable...

a.
--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Turin 2008-12-11 14:32:18 Re: COCOMO & Indians
Previous Message Tom Lane 2008-12-11 14:24:38 Re: visibility maps