From: | Ragnar Kjørstad <postgres(at)ragnark(dot)vestdata(dot)no> |
---|---|
To: | "Robert M(dot) Meyer" <rmeyer(at)installs(dot)com> |
Cc: | Eric Comeau <eric(dot)comeau(at)signiant(dot)com>, pgsql-admin(at)postgresql(dot)org |
Subject: | Re: High Availability PostgreSQL HOWTO article |
Date: | 2002-08-07 03:35:27 |
Message-ID: | 20020807053527.Q20768@vestdata.no |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
On Tue, Aug 06, 2002 at 03:42:19PM -0400, Robert M. Meyer wrote:
> I'm not a big fan of the shared storage solution. The added
> complication of identifying who has the RAID in a failover situation can
> cause unnecessary downtime. Since disk systems are typically the
> highest risk of failure items (IMHO), the single point of failure is
> just too much for high availability. Anything that happens to the
> filestore such as someone whacking files will cause a catastrophic
> failure. We're using LifeKeeper on a Compaq 3500 server in an attempt
> at high availability. We have had situations where we've lost
> connectivity due to a switch reboot and had the secondary system try to
> take over the database RAID. Since the actual primary is not dead, the
> database is still running and bad things happen during the transfer.
> Postgres isn't happy when it's data directory disappears out from under
> it.
The disk-system hardware shouldn't be a single point of failure, as
everything can be made with redundancy. The trashing that you describe
is supposed to be avoided by a fencing-mechanism in the
cluster-software. The most common fencing technique is something called
stonith (shoot the other node i the head) where a node must kill the
other one (typicly cut of power) before taking shared resources - to
avoid problems like this.
Lifekeeper uses a different technique; scsi reservations. Unfortenately
it doesn't work.
Also, the cluster-documentation (at least for heartbeat, but I think
it's the case for Lifekeeper as well) spesificly tells you that you need
multiple communication-channels between the nodes to avoid problems of
lost communication. One serial link + a dedicated ethernet crossover is
the most common way.
That said, I fully agree that clustering, and shared storage in
particular, adds complication. A buggy or incorrectly configured
cluster is likely to give more downtime than a single server.
I strongly feel that online documentation that recommends ha-software
(like the one starting this discussion) should emphasise on this, and
recommend that users that are not familiar with the software should get
professional services rather than try to follow a recipi.
What shared storage clusters will not protect you against is things like
filesystem or database corruption. So there is no substitute for
backups.
> I would recommend a system that syncs the entire database periodically
> (pg_dumpp/pg_restore?) and then creates a queued transaction path that
> gets applied at distinct intervals with the capability to post the
> transactions at the point of discovery of the failure of the primary
> system. This way, a transaction that damages the database (not
> necessarily corruption but bad data) can be caught before it makes it
> into the replica. I haven't found anything in Postgres that allows me
> to keep an SQL transaction log that I can pass over.
Check out rserv. It's somewhat simular to what you want.
> I've been thinking about the debug flags that allow me to log all of the
> queries that pass through the system as a possible path to this. It
> wouldn't be hard to write a perl/sed/awk/whatever script to strip out
> the 'DEBUG:' bits and removes the 'SELECT' queries, leaving only the
> 'UPDATE' and 'INSERT' queries. This could be trapped by a process that
> writes out the stream to a process on the replica server that would just
> store them in a file. In the event of a primary system failure, a
> process would detect the loss of the primary server, force the current
> transaction log into the database, turn off the receiver daemon, then
> take whatever actions are necessary to become the primary.
Once the primary is dead you will no longer be able to get a copy of the
transaction-log from it. That means that there will be a window, however
small, of lost data. If your application can live happily with that,
then replicated storage may be your solution. If it can't live with
that, shared storage is the only answer. (short of replication with
2-phase commits inside the database-server)
> Of course, I haven't worked out all of the details of this :-). Does
> this sound like a worthwhile endeavor? What holes can anyone punch in
> this? I know that the pg_restore time is a period of time that failover
> would be vulnerable.
I think it should work well if you can live with lost data in case of
failure. There are some details about how you do the syncs you should
look at to get them as fast as possible, and to make the windows of
potential lost data as short as possible. Look into things like
snapshots, rserv and maybe duplicating the db on the secondary before
updating it - so you have an older data-set to go back to if the primary
dies while you're syncing.
This is assuming your applications do far more reads than writes -
because if it's write intensive it get's much harder.
Also, be aware of all the complications. Even without shared storage the
clustering adds a dozen new ways your database can be corrupted! Think
for instance if you loose communications, and both servers think they're
primary. Then you have transactions going on on both servers - how are
you going to merge them afterwards? Basicly you can't, so you have to
avoid this situation from happening in the first place. (read: fencing)
If you truely want high availability, I would recommend you to get
professional services from either SuSE or IBM - they both have great
engineers on the heartbeat development team. We also offer supported
high-availability solutions, but only for storage - so for something as
special as databases you're probably better off with someone else.
--
Ragnar Kjørstad
Big Storage
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2002-08-07 04:07:37 | Re: initdb "Fails to initialize lc_time" (using 7.3.1) |
Previous Message | Tom Lane | 2002-08-07 03:18:06 | Re: Running POSTMASTER problems |