Re: Support for N synchronous standby servers - take 2

From: Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2015-07-01 14:58:27
Message-ID: CAD21AoBKAPZ1QNvhjfjBxqi56VrEYqgxrSS94jvU9x=U3BdotA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 06/29/2015 01:01 AM, Michael Paquier wrote:
>
> You're confusing two separate things. The primary manageability problem
> has nothing to do with altering the parameter. The main problem is: if
> there is more than one synch candidate, how do we determine *after the
> master dies* which candidate replica was in synch at the time of
> failure? Currently there is no way to do that. This proposal plans to,
> effectively, add more synch candidate configurations without addressing
> that core design failure *at all*. That's why I say that this patch
> decreases overall reliability of the system instead of increasing it.
>
> When I set up synch rep today, I never use more than two candidate synch
> servers because of that very problem. And even with two I have to check
> replay point because I have no way to tell which replica was in-sync at
> the time of failure. Even in the current limited feature, this
> significantly reduces the utility of synch rep. In your proposal, where
> I could have multiple synch rep groups in multiple geos, how on Earth
> could I figure out what to do when the master datacenter dies?

We can have same application name servers today, it's like group.
So there are two problems regarding fail-over:
1. How can we know which group(set) we should use? (group means
application_name here)
2. And how can we decide which a server of that group we should
promote to the next master server?

#1, it's one of the big problem, I think.
I haven't came up with correct solution yet, but we would need to know
which server(group) is the best for promoting
without the running old master server.
For example, improving pg_stat_replication view. or the mediation
process always check each progress of standby.

#2, I guess the best solution is that the DBA can promote any server of group.
That is, DBA always can promote server without considering state of
server of that group.
It's not difficult, always using lowest LSN of a group as group LSN.

>
> BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
> (assuming it's one parameter) instead of some custom syntax. If it's
> JSON, we can validate it in psql, whereas if it's some custom syntax we
> have to wait for the db to reload and fail to figure out that we forgot
> a comma. Using JSON would also permit us to use jsonb_set and
> jsonb_delete to incrementally change the configuration.

Sounds convenience and flexibility. I agree with this json format
parameter only if we don't combine both quorum and prioritization.
Because of backward compatibility.
I tend to use json format value and it's new separated GUC parameter.
Anyway, if we use json, I'm imaging parameter values like below.
{
"group1" : {
"quorum" : 1,
"standbys" : [
{
"a" : {
"quorum" : 2,
"standbys" : [
"c", "d"
]
}
},
"b"
]
}
}

> Question: what happens *today* if we have two different synch rep
> strings in two different *.conf files? I wouldn't assume that anyone
> has tested this ...

We use last defied parameter even if sync rep strings in several file, right?

Regards,

--
Sawada Masahiko

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-07-01 15:11:13 Macro nesting hell
Previous Message Simon Riggs 2015-07-01 14:56:59 Re: drop/truncate table sucks for large values of shared buffers