From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Postmaster doesn't correctly handle crashes in PM_STARTUP state |
Date: | 2023-07-29 21:51:24 |
Message-ID: | 20230729215124.ra4rbwck5dlawvmo@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
While testing something I made the checkpointer process intentionally crash as
soon as it started up. The odd thing I observed on macOS is that we start a
*new* checkpointer before shutting down:
2023-07-29 14:32:39.241 PDT [65031] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-07-29 14:32:39.244 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.244 PDT [65031] LOG: checkpointer process (PID 65032) was terminated by signal 11: Segmentation fault: 11
2023-07-29 14:32:39.244 PDT [65031] LOG: terminating any other active server processes
2023-07-29 14:32:39.244 PDT [65031] DEBUG: sending SIGQUIT to process 65034
2023-07-29 14:32:39.245 PDT [65031] DEBUG: sending SIGQUIT to process 65033
2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.245 PDT [65035] LOG: process 65035 taking over ProcSignal slot 126, but it's not empty
2023-07-29 14:32:39.245 PDT [65031] DEBUG: reaping dead processes
2023-07-29 14:32:39.245 PDT [65031] LOG: shutting down because restart_after_crash is off
Note that a new process (65035) is started after the crash has been
observed. I added logging to StartChildProcess(), and the process that's
started is another checkpointer.
I could not initially reproduce this on linux.
After a fair bit of confusion, I figured out the reason: On macOS it takes a
bit longer for the startup process to finish, which means we're still in
PM_STARTUP state when we see that crash, instead of PM_RECOVERY or PM_RUN or
...
The problem is that unfortunately HandleChildCrash() doesn't change pmState
when in PM_STARTUP:
/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
Once I figured that out, I put a sleep(1) in StartupProcessMain(), and the
problem reproduces on linux as well.
I haven't fully dug through the history, this looks to be a quite old problem.
Arguably we might also be missing PM_SHUTDOWN_2, but I can't really see a bad
consequence of that.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | José Neves | 2023-07-29 23:07:24 | CDC/ETL system on top of logical replication with pgoutput, custom client |
Previous Message | Nathan Bossart | 2023-07-29 21:40:10 | Re: should frontend tools use syncfs() ? |