Re: postmaster recovery and automatic restart suppression

From: "Czichy, Thoralf (NSN - FI/Helsinki)" <thoralf(dot)czichy(at)nsn(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: "ext Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Stark" <stark(at)enterprisedb(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "ext Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "ext Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Kolb, Harald (NSN - DE/Munich)" <harald(dot)kolb(at)nsn(dot)com>
Subject: Re: postmaster recovery and automatic restart suppression
Date: 2009-06-16 15:22:59
Message-ID: 2CD972C575FD624E814506C4916F3E1F50E59C@FIESEXC035.nsn-intra.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

hi,

I am working together with Harald on this issue. Below some thoughts on
why we think it should be possible to disable the postmaster-internal
recovery attempt and instead have faults in the processes started
by postmaster escalated to postmaster-exit.

[Our typical "embedded" situation]

* Database is small 0.1 to 1 GB (e.g. we consider it the safest strategy

to copy the whole database from the active to standby before
reconnecting the standby after switchover or failover).

* Few clients only (10-100)

* There is no shared storage between the two instances (this means no
concurrent access to shared resources, no isolation problems for
shared resources)

* Switchover is fast, less than a few seconds

* Disk I/O is slow (no RAID, possibly (slow) flash-based)

* The same nodes running database also run lots of other functionality
(some dependent on DB, most not)

[Keep recovery decision and recovery action in cluster-HA-middleware]

Actually the problem we're trying to solve is to keep the decision
what's
the best recovery strategy outside of the DB. In our use case this logic

is expressed in the cluster-HA-middleware and recovery actions are
initiated
by this middleware rather than each individual piece of software started
by
it; software is generally expected to "fail fast and safe" in case of
errors. As long as you trust hardware and OS kernel, a process exit is
usually such a fail fast and safe operation. It's "Safe" because process

exit causes the kernel to release the resources the process holds. It's
also
fast. Though, "fast" is a bit more debatable as a simple signal from the

postmaster to the cluster middleware would probably be faster. However
lacking such a signal, a SIGCHILD is the next best thing.

The middleware can make decisions such as (all of this is configurable
and postmaster-health is _just_one_input_ of many to reach a decision on

the correct behavior)

Policy 1: By default try to restart the active instance N times, after
that do a switchover
Policy 2: If the active Postgres fails and the standby is available and

up-to-date, do an immediate switchover. If the standby is not

available, restart.
Policy 3: If the active Postgres fails, escalate the problem to
node-level,
isolate the active node and do the switchover to the standby.

Policy 4: In single-node systems, restart db instance N times. If it
fails
more often than N times in X seconds, stop it and give an
indication to the operator (SNMP-trap to management system,
text
message, ...) that something is seriously wrong and manual
intervention is needed.

In the current setup we want to go for Policy 2. In earlier unrelated
products (not using PostgreSQL) we actually had policies 1, 3 and 4.

Another typical situation is that recovery behavior is different during
upgrades compared to the behavior during normal operation. E.g. when
the (new) database instance fails during an automatic schema-conversion
during upgrade we would want to automatically fallback to the previous
version.

[STONITH is not always best strategy if failures can be declared as
user-space software problem only, limit STONITH to HW/OS failures]

The isolation of the failing Postgres instance does not require a
STONITH
- mainly as there's also other software running on the same node that
we'd
not want to automatically switchover (e.g. because it takes longer to do
or
the functionality is more critical or less critical). Also we generally
trust
the HW, OS kernel and cluster middleware to behave correctly . These
functions
also follow the principle of fail-fast-and-safe. This trust might be an
assumption that not everybody agrees with, though. So, if the failure
originated
from HW/OS/Clusterware it clearly is a STONITH situation, but if it's a
user-space problem - the default assumption is that isolation can be
implemented on
OS-level and that's a guarantee that the clusterware gives (using a
separate
Quorum mechanism to avoid split-brain situations).

[Example of user-space software failures]

So, what kind of failures would cause a user-space switchover rather
than
node-level isolation? This gets a bit philosophical. If you assume that
many
software failures are caused by concurrency issues, switching over to
the
standby is actually a good strategy as it's unlikely that the same
concurrency
issue happens again on the standby. Another reason for software failures

is entering exceptional situations, such as disk getting full, overload
on the
node (causes by some other process), backup being taken, upgrade
conversion
etc. So here the idea is that failover to a standby instance helps as
long as
there's some hope that on the standby side the situation is different.
If we'd
just have an internal Postgres restart in such situations, we'd have
flapping
db connectivity - without the operator even being aware of it (awareness
about
problem situations is also something that the cluster HA middleware
takes care
of).

[Possible implementation options]

I see only two solutions to allow an external cluster-HA-middleware to
make
recovery decisions:

(1) postmaster process exits if it detects any unpredicted failure or

(2) have postmaster provide an interface to notify about software
failures (i.e. the case it goes into postmaster re-initializing).

In case (2) it would be the cluster-HA-middleware that isolates the
postmaster
process, e.g. by SIGKILL-ing all related processes and forcefully
releasing all
shared resources that it uses. However, I favor case (1) as long as we
would keep
the logic that runs within the postmaster in case it detects a backend
process
failure as simple as possible - meaning force-stop all postgres
processes
(SIGKILL), wait for SIGCHLD from them and exit (should only take few
milliseconds).

[Question]

So the question remains: Is this behavior and the most likely addition
of a
postgresql.conf ""automatic_restart_after_crash = on" something that
completely
goes against the Postgres philosopy or is this something that once
implemented
would be acceptable to have in the main Postgres code base?

Thoralf

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2009-06-16 15:45:26 Re: machine-readable explain output
Previous Message Andrew Dunstan 2009-06-16 15:20:22 Re: machine-readable explain output