Re: postgres in HA constellation

From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: postgres in HA constellation
Date: 2006-10-13 19:30:07
Message-ID: 60ac40814w.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

bnichols(at)ca(dot)afilias(dot)info (Brad Nicholson) writes:
> On Wed, 2006-10-11 at 16:12 -0500, Jim C. Nasby wrote:
>> On Wed, Oct 11, 2006 at 10:28:44AM -0400, Andrew Sullivan wrote:
>> > On Thu, Oct 05, 2006 at 08:43:21PM -0500, Jim Nasby wrote:
>> > > Isn't it entirely possible that if the master gets trashed it would
>> > > start sending garbage to the Slony slave as well?
>> >
>> > Well, maybe, but unlikely. What happens in a shared-disc failover is
>> > that the second machine re-mounts the same partition as the old
>> > machine had open. The risk is the case where your to-be-removed
>> > machine hasn't actually stopped writing on the partition yet, but
>> > your failover software thinks it's dead, and can fail over. Two
>> > processes have the same Postgres data and WAL files mounted at the
>> > same time, and blammo. As nearly as I can tell, it takes
>> > approximately zero time for this arrangement to make such a mess that
>> > you're not committing any transactions. Slony will only get the data
>> > on COMMIT, so the risk is very small.
>>
>> Hrm... I guess it depends on how quickly the Slony master would stop
>> processing if it was talking to a shared-disk that had become corrupt
>> from another postmaster.
>
> That doesn't depend on Slony, it depends on Postgres. If transactions
> are committing on the master, Slony will replicate them. You could have
> a situation where your HA failover trashes some of you database, but the
> database still starts up. It starts accepting and replicating
> transactions before the corruption is discovered.

There's a bit of "joint responsibility" there.

Let's suppose that the disk has gone bad, zeroing out some index pages
for the Slony-I table sl_log_1. (The situation will be the same for
just about any kind of corruption of a Slony-I internal table.)

There are two possibilities:
1. The PostgreSQL instance may notice that those pages are bad,
returning an error message, and halting the SYNC.

2. The PostgreSQL instance may NOT notice that those pages are bad,
and, as a result, fail to apply some updates, thereby corrupting
the subscriber.

I think there's a pretty high probability of 1) happening rather than
2), but there is a risk of corruption of subscribers roughly
proportional to the probability of 2).

My "gut feel" is that the probability of 2) is pretty small, but I
don't have anything to point to as a proof of that...
--
output = reverse("gro.mca" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/
"One of the main causes of the fall of the Roman Empire was that,
lacking zero, they had no way to indicate successful termination of
their C programs." -- Robert Firth

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Aaron Bono 2006-10-13 21:25:36 Re: Recursive use
Previous Message Brad Nicholson 2006-10-13 15:52:14 Re: postgres in HA constellation