Quick Links

Re: bgworker crashed or not?

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Antonin Houska <antonin(dot)houska(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Subject:	Re: bgworker crashed or not?
Date:	2014-05-06 17:05:22
Message-ID:	CA+TgmoZyW_xVDyew0pNVTe43VbNhPMp2kGi+kcw+f7fM-GXy7w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, May 4, 2014 at 9:57 AM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> On 30/04/14 23:35, Robert Haas wrote:
>> On Mon, Apr 28, 2014 at 4:19 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com>
>> wrote:
>>> On 28/04/14 16:27, Robert Haas wrote:
>>>> It seems we have consensus on what to do about this, but what we
>>>> haven't got is a patch.
>>>
>>>
>>> If you mean the consensus that exit status 0 should mean permanent stop
>>> then
>>> I think the patch can be as simple as attached.
>>
>> Hmm. Well, at the very least, you need to update the comment.
>
> Obviously, also docs, the above was just demonstration that the patch is
> simple (and also all I could manage to write and test on the way from
> airport to hotel).
>
> The attached patch is more proper.

Hmm, I see some problems here. The current comment for
bgworker_quickdie() says that we do exit(0) there rather than exit(2)
to avoid having the postmaster delay restart based on
bgw_restart_time, but with the proposed change in the meaning of
exit(2), that would have the effect of unregistering the worker, which
I guess is why you've changed it to call exit(2) instead, but then the
bgw_restart_delay applies, which I think we do not want. To make
matters more confusing, the existing comment is actually only correct
for non-shmem-connected workers - shmem-connected workers will treat
an exit status of anything other than 0 or 1 in exactly the same
fashion as a failure to disengage the deadman switch.

Which brings up another point: the behavior of non-shmem-connected
workers is totally bizarre. An exit status other than 0 or 1 is not
treated as a crash requiring a restart, but failure to disengage the
deadman switch is still treated as a crash requiring a restart. Why?
If the workers are not shmem-connected, then no crash requires a
system-wide restart. Of course, there's the tiny problem that we
aren't actually unmapping shared memory from supposedly non-shmem
connected workers, which is a different bug, but ignoring that for the
moment there's no reason for this logic to be like this.

What I'm inclined to do is change the logic so that:

(1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so
that anything which is still registered gets restarted immediately.

(2) If a shmem-connected backend fails to release the deadman switch
or exits with an exit code other than 0 or 1, we crash-and-restart. A
non-shmem-connected backend never causes a crash-and-restart.

(3) When a background worker exits without triggering a
crash-and-restart, an exit code of precisely 0 causes the worker to be
unregistered; any other exit code has no special effect, so
bgw_restart_time controls.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: bgworker crashed or not? at 2014-05-04 13:57:33 from Petr Jelinek

Responses

Re: bgworker crashed or not? at 2014-05-07 00:25:40 from Petr Jelinek

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Josh Berkus	2014-05-06 17:08:01	Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Previous Message	Andres Freund	2014-05-06 16:55:05	Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers