I have observed the following situation
a few times now , with 8.4.5.
Multiple PSQL clients are connected to server,
some of them running transaction and
some of them are idle state.
When one of the backend is killed or crashed
(using kill -9 <backend-pid>).
The connection reset attempt from the active
clients( that is, which were running a transaction and crashed in between) fails, since they immediately make
the attempt while the server is in startup phase.
As you can see from following:
-----------------------
ACTIVE CLIENT
-----------------------
[amul(at)localhost ~]$ psql -p 5432 postgres psql
(8.4.5) Type "help" for help.
postgres=# create table emp( id int,name
varchar(20)); CREATE TABLE postgres=#
insert into emp values(generate_series(1,999999999),'XYZ');
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another
server process exited abnormally and
possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost.
Attempting reset: Failed.
!
-----------------------
IDLE CLIENT
-----------------------
[amul(at)localhost ~]$ psql -p 5432 postgres psql (8.4.5) Type "help" for help.
postgres=# select pg_backend_pid();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost.
Attempting reset: Succeeded.
postgres=#
I
just gone through and found following:
1. When backend crashes , server goes into
recovery mode and come in the normal
state to accept connection, it take little time.
2. But at busy client(which was running
transaction before crash), immediately
tries to reconnect to server which is under startup phase so it gets a negative reply and fails to reconnect.
So I thought, before sending reconnect request
from client need to wait for the server
come to a state when it can accept connections It should have some timeout wait.
I am not sure is this correct way to code modification or does it have any other impact.
I
tried wait to client before sending reconnect request to server.
For that added some sleep time for client in
src/bin/psql/common.c (that is it changes things only for psql clients)
Please check the attached patch for the
modification.