Quick Links

Re: Issues with Quorum Commit

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 01:52:10
Message-ID:	1286329930.28453.72.camel@jdavis-ux.asterdata.local
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 22:19 +0100, Simon Riggs wrote:
> > In other words, a lagging standby combined with a timeout mechanism is
> > essentially useless, because it will never catch up in time to be a part
> > of the quorum.
>
> Thanks for explaining what was meant.
>
> This issue is a serious problem with the apply to *all* servers that
> Heikki has been describing as being a useful use case. We register a
> standby, it goes down and we decide to wait for it. Then when it does
> come back up it takes ages to catch up.
>
> This is really the nail in the coffin for the "All" servers use case,
> and a significant blow to the requirement for standby registration.

I'm not sure I entirely understand. I was concerned about the case of a
standby server being allowed to lag behind the rest by a large number of
WAL records. That can't happen in the "wait for all servers to apply"
case, because the system would become unavailable rather than allow a
significant difference in the amount of WAL applied.

I'm not saying that an unavailable system is good, but I don't see how
my particular complaint applies to the "wait for all servers to apply"
case.

The case I was worried about is:
* 1 master and 2 standby
* The rule is "wait for at least one standby to apply the WAL"

In your notation, I believe that's M -> { S1, S2 }

In that case, if one S1 is just a little faster than S2, then S2 might
build up a significant queue of unapplied WAL. Then, when S1 goes down,
there's no way for the slower one to acknowledge a new transaction
without playing through all of the unapplied WAL.

Intuitively, the administrator would think that he was getting both HA
and redundancy, but in reality the availability is no better than if
there were only two servers (M -> S1), except that it might be faster to
replay the WAL then to set up a new standby (but that's not guaranteed).

I think you would call that a misconfiguration, and I would agree. I was
just trying to point out a pitfall that I didn't see until I read Josh's
email.

> If we use N+1 redundancy as I have explained, then this situation does
> not occur until you have less than N standbys available. But then it's
> no surprise that RAID-5 won't work with 4 drives either.

Now I'm more confused. I assume that was a typo (because a RAID-5 does
work with 4 drives), but I think it obscured your point.

Regards,
Jeff Davis

In response to

Re: Issues with Quorum Commit at 2010-10-05 21:19:00 from Simon Riggs

Responses

Re: Issues with Quorum Commit at 2010-10-06 02:31:49 from Simon Riggs
Re: Issues with Quorum Commit at 2010-10-06 08:01:55 from Fujii Masao

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2010-10-06 02:17:52	Re: host name support in pg_hba.conf
Previous Message	KaiGai Kohei	2010-10-06 01:21:35	Re: host name support in pg_hba.conf