Re: Support for N synchronous standby servers - take 2

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Beena Emerson <memissemerson(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for N synchronous standby servers - take 2
Date: 2015-06-29 17:40:56
Message-ID: 55918328.5010603@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/29/2015 01:01 AM, Michael Paquier wrote:
> On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

>> Right. Well, another reason we should be using a system catalog and not
>> a single GUC ...
>
> I assume that this takes into account the fact that you will still
> need a SIGHUP to reload properly the new node information from those
> catalogs and to track if some information has been modified or not.

Well, my hope was NOT to need a sighup, which is something I see as a
failing of the current system.

> And the fact that a connection to those catalogs will be needed as
> well, something that we don't have now.

Hmmm? I was envisioning the catalog being used as one on the master.
Why do we need an additional connection for that? Don't we already need
a connection in order to update pg_stat_replication?

> Another barrier to the catalog
> approach is that catalogs get replicated to the standbys, and I think
> that we want to avoid that.

Yeah, it occurred to me that that approach has its downside as well as
an upside. For example, you wouldn't want a failed-over new master to
synchrep to itself. Mostly, I was looking for something reactive,
relational, and validated, instead of passing an unvalidated string to
pg.conf and hoping that it's accepted on reload. Also some kind of
catalog approach would permit incremental changes to the config instead
of wholesale replacement.

> But perhaps you simply meant having an SQL
> interface with some metadata, right? Perhaps I got confused by the
> word 'catalog'.

No, that doesn't make any sense.

>>>> I'm personally not convinced that quorum and prioritization are
>>>> compatible. I suggest instead that quorum and prioritization should be
>>>> exclusive alternatives, that is that a synch set should be either a
>>>> quorum set (with all members as equals) or a prioritization set (if rep1
>>>> fails, try rep2). I can imagine use cases for either mode, but not one
>>>> which would involve doing both together.
>>>>
>>>
>>> Yep, separating the GUC parameter between prioritization and quorum
>>> could be also good idea.
>>
>> We're agreed, then ...
>
> Er, I disagree here. Being able to get prioritization and quorum
> working together is a requirement of this feature in my opinion. Using
> again the example above with 2 data centers, being able to define a
> prioritization set on the set of nodes of data center 1, and a quorum
> set in data center 2 would reduce failure probability by being able to
> prevent problems where for example one or more nodes lag behind
> (improving performance at the same time).

Well, then *someone* needs to define the desired behavior for all
permutations of prioritized synch sets. If it's undefined, then we're
far worse off than we are now.

>>> Also I think that we must enable us to decide which server we should
>>> promote when the master server is down.
>>
>> Yes, and probably my biggest issue with this patch is that it makes
>> deciding which server to fail over to *more* difficult (by adding more
>> synchronous options) without giving the DBA any more tools to decide how
>> to fail over. Aside from "because we said we'd eventually do it", what
>> real-world problem are we solving with this patch?
>
> Hm. This patch needs to be coupled with improvements to
> pg_stat_replication to be able to represent a node tree by basically
> adding to which group a node is assigned. I can draft that if needed,
> I am just a bit too lazy now...
>
> Honestly, this is not a matter of tooling. Even today if a DBA wants
> to change s_s_names without touching postgresql.conf you could just
> run ALTER SYSTEM and then reload parameters.

You're confusing two separate things. The primary manageability problem
has nothing to do with altering the parameter. The main problem is: if
there is more than one synch candidate, how do we determine *after the
master dies* which candidate replica was in synch at the time of
failure? Currently there is no way to do that. This proposal plans to,
effectively, add more synch candidate configurations without addressing
that core design failure *at all*. That's why I say that this patch
decreases overall reliability of the system instead of increasing it.

When I set up synch rep today, I never use more than two candidate synch
servers because of that very problem. And even with two I have to check
replay point because I have no way to tell which replica was in-sync at
the time of failure. Even in the current limited feature, this
significantly reduces the utility of synch rep. In your proposal, where
I could have multiple synch rep groups in multiple geos, how on Earth
could I figure out what to do when the master datacenter dies?

BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
(assuming it's one parameter) instead of some custom syntax. If it's
JSON, we can validate it in psql, whereas if it's some custom syntax we
have to wait for the db to reload and fail to figure out that we forgot
a comma. Using JSON would also permit us to use jsonb_set and
jsonb_delete to incrementally change the configuration.

Question: what happens *today* if we have two different synch rep
strings in two different *.conf files? I wouldn't assume that anyone
has tested this ...

>> It's always been a problem that one can accomplish a de-facto
>> denial-of-service by joining a cluster using the same application_name
>> as the synch standby, moreso because it's far too easy to do that
>> accidentally. One needs to simply make the mistake of copying
>> recovery.conf from the synch replica instead of the async replica, and
>> you've created a reliability problem.
>
> That's a scripting problem then. There are many ways to do a false
> manipulation in this area when setting up a standby. application_name
> value is one, you can do worse by pointing to an incorrect IP as well,
> miss a firewall filter or point to an incorrect port.

You're missing the point. We've created something unmanageable because
we piggy-backed it onto features intended for something else entirely.
Now you're proposing to piggy-back additional features on top of the
already teetering Bejing-acrobat-stack of piggy-backs we already have.
I'm saying that if you want synch rep to actually be a sophisticated,
high-availability system, you need it to actually be high-availability,
not just pile on additional configuration options.

I'm in favor of a more robust and sophisticated synch rep. But not if
nobody not on this mailing list can configure it, and not if even we
don't know what it will do in an actual failure situation.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2015-06-29 19:34:36 Re: pg_trgm version 1.2
Previous Message Simon Riggs 2015-06-29 17:22:02 Re: Reduce ProcArrayLock contention