Re: High Availability with Postgres

From: John R Pierce <pierce(at)hogranch(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: High Availability with Postgres
Date: 2010-06-21 19:39:19
Message-ID: 4C1FBFE7.5030401@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 06/21/10 12:23 PM, Dimitri Fontaine wrote:
> John R Pierce<pierce(at)hogranch(dot)com> writes:
>
>>>> Two DB servers will be using a common external storage (with raid).
>>>>
>> This is also one of the only postgres HA configurations that won't lose
>> /any/ committed transactions on a failure. Most all PITR/WAL
>> replication/Slony/etc configs, the standby storage runs several seconds
>> behind realtime.
>>
> I'm not clear on what error case it protects against, though. Either the
> data is ok and a single PostgreSQL system will restart fine, or the data
> isn't and you're hosed the same with or without the second system.
>
> What's left is hardware failure that didn't compromise the data. I
> didn't see much hardware failure yet, granted, but I'm yet to see a
> motherboard, some RAM or a RAID controller failing in a way that leaves
> behind data you can trust.
>

in most of the HA clusters I've seen, the raid controllers are in the
SAN, not in the hosts, and they have their own failover, with shared
write cache, also extensive use of ECC so things like double-bit memory
errors are detected and treated as a failure. the sorts of high end
SANs used in these kinds of systems have 5-9's reliability, through
extensive use of redundancy, dual port disks, fully redundant
everything, mirrored caches, etc.

ditto, the servers used in these sorts of clusters have ECC memory, so
memory failure should be detected rather than passed on blindly in the
form of corrupted data. Server grade CPUs, especially the RISC ones,
have extensive ECC internally on their caches, data busses, etc, so any
failure there is detected rather than allowed to corrupt data. failure
modes can include things like failing fans (which will be detected,
resulting in a server shutdown if too many fail), power supply failure
(redundant PSUs, but I've seen the power combining circuitry fail).
Any of these sorts of failures will result in a failover without
corrupting the data.

and of course, intentional planned failovers to do OS maintenance...
you patch the standby system, fail over to it and verify its good, then
patch the other system.

We had a large HA system at an overseas site fail over once due to
flooding in the primary computer room caused by a sprinkler system
failure upstairs. The SAN was mirrored to a SAN in the 2nd DC (fiber
inteconnected) and the backup server was also in the second DC across
campus, so it all failed over gracefully. This particular system was
large Sun hardware and big EMC storage, and it was running Oracle rather
than Postgres. We've had several big UPS failures at various sites,
too, ditto HVAC, over a 15 year period.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Geoffrey 2010-06-21 19:45:32 Re: pgpool
Previous Message Andrus 2010-06-21 19:32:23 Re: How to force select to return exactly one row