Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From: Daniel Hahler <postgresql(at)thequod(dot)de>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
Date: 2014-03-25 15:17:52
Message-ID: 53319E20.9030006@thequod.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 25.03.2014 15:36, Alvaro Herrera wrote:
> Tom Lane wrote:
>> postgresql(at)thequod(dot)de writes:
>>> PostgreSQL just failed to startup after a reboot (which was forced via
>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host):
>>
>>> 2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection
>>> reset by peer
>>> 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array
>>> 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108,
>>> but it contains 0
>>> 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109,
>>> but it contains 0
>>> 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110,
>>> but it contains 0
>>
>> That's odd (and as you say, unexpected) but this log extract doesn't give
>> much clue as to how we got into this state. What was going on before
>> this? In particular, it's hard to call this "failure to start up" when
>> you evidently had a hundred or so postmaster child processes already.
>> Could there have been some unexpected surge in the number of connection
>> attempts just after the database came up? Also, this extract doesn't look
>> like anything that would've caused the postmaster to decide to shut down
>> again, so what happened after that? Or in short, I want to see the rest
>> of the log not just this part.

That was the whole log.

The rotated one before has only:
2014-03-22 03:51:37 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 03:52:25 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 03:59:31 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 04:00:18 CET LOG: could not receive data from client: Connection reset by peer
2014-03-22 06:03:06 CET LOG: could not receive data from client: Connection reset by peer

Should I increase the logging verbosity, in case this happens again?
If so, to what? (I have not configured logging yet, so it has the defaults from your Debian package).

> Here's my guess --- this is a virtualized system that somehow dumped
> some state to disk to hibernate while the host was being rebooted; and
> then, when the host was up again, it tried to resurrect the virtual
> machine and found things to be all inconsistent.

Yes, the container was frozen during reboot:

From the host:
Mar 25 11:54:48 HN kernel: [ 76.237452] CT: 144: started
Mar 25 11:55:03 HN kernel: [ 91.201145] CT: 144: restored

OpenVZ uses "suspend" by default to stop containers on host reboots.
I will change this to "stop" for the PostgreSQL container, but still this seems like something PostgreSQL should handle better.

FWIW, I have just suspended and started the container manually, and PostgreSQL kept running (upgraded to 9.3.4 in the meantime).

Maybe it's a bug with OpenVZ and how it restores some resources after rebooting the host?

Please also note that the PostgreSQL error happened half an hour after the reboot/resuming of the container.

Thanks,
Daniel.

--
http://daniel.hahler.de/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2014-03-25 15:26:06 Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
Previous Message Alvaro Herrera 2014-03-25 14:36:47 Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array