Quick Links

Re: src/test/subscription/t/002_types.pl hanging on particular environment

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: src/test/subscription/t/002_types.pl hanging on particular environment
Date:	2017-09-18 16:42:24
Message-ID:	30631.1505752944@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The subscriber log includes
>> 2017-09-18 08:43:08.240 UTC [15672] WARNING: out of background worker slots
>> Maybe that's harmless, but I'm suspicious that it's a smoking gun.

> I have noticed while fixing the issue you are talking that this path
> is also susceptible to such problems. In
> WaitForReplicationWorkerAttach(), it relies on
> GetBackgroundWorkerPid() to know the status of the worker which won't
> give the correct status in case of fork failure. The patch I have
> posted has a fix for the issue,

Your GetBackgroundWorkerPid fix looks good as far as it goes, but
I feel that WaitForReplicationWorkerAttach's problems go way deeper
than that --- in fact, that function should not exist at all IMO.

The way that launcher.c is set up, it's relying on either the calling
process or the called process to free the logicalrep slot when done.
The called process might never exist at all, so the second half of
that is obviously unreliable; but WaitForReplicationWorkerAttach
can hardly be relied on either, seeing it's got a big fat exit-on-
SIGINT in it. I thought about putting a PG_TRY, or more likely
PG_ENSURE_ERROR_CLEANUP, into it, but that doesn't fix the basic
problem: you can't assume that the worker isn't ever going to start,
just because you got a signal that means you shouldn't wait anymore.

Now, this rickety business might be fine if it were only about the
states of the caller and called processes, but we've got long-lived
shared data involved (ie the worker slots); failing to clean it up
is not an acceptable outcome.

So, frankly, I think we would be best off losing the "logical rep
worker slot" business altogether, and making do with just bgworker
slots. The alternative is getting the postmaster involved in cleaning
up lrep slots as well as bgworker slots, and I'm going to resist
any such idea strongly. That way madness lies, or at least an
unreliable postmaster --- if we need lrep slots, what other forty-seven
data structures are we going to need the postmaster to clean up
in the future?

I haven't studied the code enough to know why it grew lrep worker
slots in the first place. Maybe someone could explain?

regards, tom lane

In response to

Re: src/test/subscription/t/002_types.pl hanging on particular environment at 2017-09-18 15:14:46 from Amit Kapila

Responses

Re: src/test/subscription/t/002_types.pl hanging on particular environment at 2017-09-19 10:04:48 from Petr Jelinek

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2017-09-18 16:50:40	Re: [PATCH] Add citext_pattern_ops to citext contrib module
Previous Message	Tom Lane	2017-09-18 16:30:44	Re: parallel.c oblivion of worker-startup failures