Re: Issues with Quorum Commit

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Issues with Quorum Commit
Date: 2010-10-05 21:19:00
Message-ID: 1286313540.2025.2923.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2010-10-05 at 13:45 -0700, Jeff Davis wrote:
> On Tue, 2010-10-05 at 12:11 -0700, Josh Berkus wrote:
> > B. Eventual Inconsistency
> > -------------------------
> > If we have a quorum commit, it's possible for any individual standby to
> > be indefinitely ahead of any standby which is not needed by the quorum.
> > This means that:
> >
> > -- There is no clear criteria for when a standby which is not needed for
> > quorum should be considered no longer a synch standby, and
> > -- Applications cannot make assumptions that synch rep promises some
> > specific window of synchronicity, eliminating a lot of the value of
> > quorum commit.
>
> Point B seems particularly dangerous.
>
> When you lose one of the systems and the lagging server becomes required
> for quorum, then all of a sudden you could be facing a huge delay to
> commit the next transaction (because it needs to catch up on a lot of
> WAL replay). This can happen even without a network problem at all, and
> seems very likely to result in the lagging system being considered
> "down" due to a timeout. Not good, because the reason it is required for
> quorum is because another standby just went down.
>
> In other words, a lagging standby combined with a timeout mechanism is
> essentially useless, because it will never catch up in time to be a part
> of the quorum.

Thanks for explaining what was meant.

This issue is a serious problem with the apply to *all* servers that
Heikki has been describing as being a useful use case. We register a
standby, it goes down and we decide to wait for it. Then when it does
come back up it takes ages to catch up.

This is really the nail in the coffin for the "All" servers use case,
and a significant blow to the requirement for standby registration.

If we use N+1 redundancy as I have explained, then this situation does
not occur until you have less than N standbys available. But then it's
no surprise that RAID-5 won't work with 4 drives either.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-10-05 21:21:31 Re: Issues with Quorum Commit
Previous Message Simon Riggs 2010-10-05 21:10:53 Re: Issues with Quorum Commit