From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Strange failure on mamba |
Date: | 2022-11-17 22:35:10 |
Message-ID: | CA+hUKGJwQ-J+hRZ+zG=s7FZWKJ-suHi-aQREbRpWUrR-JcOm8g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > I wonder why the walreceiver didn't start in
> > 008_min_recovery_point_node_3.log here:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-16%2023%3A13%3A38
>
> mamba has been showing intermittent failures in various replication
> tests since day one. My guess is that it's slow enough to be
> particularly subject to the signal-handler race conditions that we
> know exist in walreceivers and elsewhere. (Now, it wasn't any faster
> in its previous incarnation as a macOS critter. But maybe modern
> NetBSD has different scheduler behavior than ancient macOS and that
> contributes somehow. Or maybe there's some other NetBSD weirdness
> in here.)
>
> I've tried to reproduce manually, without much success :-(
>
> Like many of its other failures, there's a suggestive postmaster
> log entry at the very end:
>
> 2022-11-16 19:45:53.851 EST [2036:4] LOG: received immediate shutdown request
> 2022-11-16 19:45:58.873 EST [2036:5] LOG: issuing SIGKILL to recalcitrant children
> 2022-11-16 19:45:58.881 EST [2036:6] LOG: database system is shut down
>
> So some postmaster child is stuck somewhere where it's not responding
> to SIGQUIT. While it's not unreasonable to guess that that's a
> walreceiver, there's no hard evidence of it here. I've been wondering
> if it'd be worth patching the postmaster so that it's a bit more verbose
> about which children it had to SIGKILL. I've also wondered about
> changing the SIGKILL to SIGABRT in hopes of reaping a core file that
> could be investigated.
I wonder if it's a runtime variant of the other problem. We do
load_file("libpqwalreceiver", false) before unblocking signals but
maybe don't resolve the symbols until calling them, or something like
that...
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2022-11-17 22:37:51 | Re: contrib: auth_delay module |
Previous Message | Tom Lane | 2022-11-17 22:26:52 | Re: Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow. |