Error stopping postgresql service on a standby server.

From: Dipti Bharvirkar <dbharvirkar(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Error stopping postgresql service on a standby server.
Date: 2012-08-31 07:47:38
Message-ID: CAL097-FMKWhBdMPWVgTmZBJYOVR_iUWx-sx_84oJGsF3csBe5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

In our project, we use Postgres 9.1.3 version and asynchronous streaming
replication.
In recent times, on couple of our setups, we saw issues stopping Postgres
service on the standby server after streaming replication was setup.

The command "service postgresql stop" returned with a failure message. We
use "pg_ctl stop -D '$PGDATA' -s -m fast" in the Postgres service script to
stop the server.
To see if there were some active client connections that were causing a
failure in stopping Postgres service, I ran the query "SELECT * FROM
pg_stat_activity;".
It failed with the following error: psql: FATAL: the database system is
shutting down

"ps -ef | grep postgres" returned the following:
postgres 14033 1 0 Aug28 ? 00:00:01
/usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 14044 14033 0 Aug28 ? 00:00:00 postgres: logger process
postgres 14046 14033 0 Aug28 ? 00:00:00 postgres: writer process
postgres 14047 14033 0 Aug28 ? 00:00:00 postgres: stats collector
process
postgres 14912 14033 0 Aug28 ? 00:00:00 postgres: wal receiver
process
root 31519 3003 0 06:18 pts/2 00:00:00 grep postgres

"netstat -anp | grep 5432" returns the following:
tcp 0 0 0.0.0.0:5432 0.0.0.0:*
LISTEN 14033/postmaster
tcp 0 0 127.0.0.1:5432 127.0.0.1:60597
TIME_WAIT -
tcp 0 0 127.0.0.1:5432 127.0.0.1:60589
TIME_WAIT -
tcp 67288 0 1.1.1.1:61500 <http://47.11.49.176:61500> 2
.2.2.2:5432 <http://47.11.49.190:5432> ESTABLISHED
14912/postgres: wal

I had a few queries based on some of the observations -

1. On one of the setups where similar issue was observed, we stopped
Postgres service on the master server. As a result of this, the sender
process on the master server and consequently the receiver process on
standby stopped. After this, Postgres service could successfully be stopped
on the standby server. This fact coupled with the output of the two
commands mentioned above makes me believe that it is the "wal receiver"
process that is not getting terminated because of which the Postgres
service on standby server does not stop. Is this assumption right?
2. If yes, what could be the possible cause for the receiver process to
not terminate? Shouldn't it stop gracefully when a shutdown command is
received? When the issue occurred, we had minimal activity on the master
server. There were no long running transactions being committed to the
master and streamed to the standby when the issue occurred. Even if there
were, could it cause the receiver process to not terminate?
3. How can we avoid running into this issue? Could we be missing some
step that is essential for a graceful shutdown of the service on a standby?
4. On one setup where the issue was seen, since "-m fast" option with
"pg_ctl stop" did not help in stopping the service, I used the "-m
immediate" option. The service stopped (I understand that this option
aborts the processes without a clean shutdown and so is not a safe option).
The service would not start back up. We saw the "invalid record length"
error during the startup (I guess this was expected since it wasn't a clean
shutdown). A pg_resetxlog helped recover from this issue. However, that
seems risky too since there is a chance of data inconsistency. What is the
best way to recover from this error if it occurs again?

Thanks,
Dipti

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Achilleas Mantzios 2012-08-31 08:48:48 Re: psql & unix env variables
Previous Message Pavan Deolasee 2012-08-31 05:39:42 Re: [GENERAL] UPDATE RULE to be invoked when UPDATE .. WHERE fails the WHERE predicate ?‏