From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: Race conditions with checkpointer and shutdown |
Date: | 2019-04-16 22:45:29 |
Message-ID: | 21934.1555454729@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Michael Paquier <michael(at)paquier(dot)xyz> writes:
> The buildfarm has reported two similar failures when shutting down a
> node:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01
> In both cases, the instance cannot shut down because it times out,
> waiting for the shutdown checkpoint to finish but I suspect that this
> checkpoint actually never happens.
Hmm, I don't think that that is actually where the problem is. In
piculet's failure, the test script times out waiting for a "fast"
shutdown of the standby server, and what we see in the standby's log is
2019-03-23 22:44:12.181 UTC [9731] LOG: received fast shutdown request
2019-03-23 22:44:12.181 UTC [9731] LOG: aborting any active transactions
2019-03-23 22:44:12.181 UTC [9960] FATAL: terminating walreceiver process due to administrator command
2019-03-23 22:50:13.088 UTC [9731] LOG: received immediate shutdown request
where the last line indicates that the test script lost patience and
issued an immediate shutdown. However, in a successful run of the
test, the log looks like
2019-03-24 03:33:25.592 UTC [23816] LOG: received fast shutdown request
2019-03-24 03:33:25.592 UTC [23816] LOG: aborting any active transactions
2019-03-24 03:33:25.592 UTC [23895] FATAL: terminating walreceiver process due to administrator command
2019-03-24 03:33:25.595 UTC [23819] LOG: shutting down
2019-03-24 03:33:25.600 UTC [23816] LOG: database system is shut down
2019-03-24 03:33:25.696 UTC [23903] LOG: starting PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.2.0-12) 8.2.0, 64-bit
where the last line reflects restarting the server for the next test step.
So in the failure case we don't see the "shutting down" message, which
means we never got to ShutdownXLOG, so no checkpoint request was made.
Even if we had got to ShutdownXLOG, the process is just executing the
operation directly, it's not sending a signal asking some other process
to do the checkpoint; so it's hard to see how either of the commits
you mention could be involved.
I think what we need to look for is reasons why (1) the postmaster
never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
main loop doesn't get to noticing shutdown_requested.
A rather scary point for (2) is that said main loop seems to be
assuming that MyLatch a/k/a MyProc->procLatch is not used for any
other purposes in the checkpointer process. If there were something,
like say a condition variable wait, that would reset MyLatch at any
time during a checkpoint, then we could very easily go to sleep at the
bottom of the loop and not notice that there's a pending shutdown request.
Now, c6c9474aa did not break this, because the latch resets that
it added happen in other processes not the checkpointer. But I'm
feeling suspicious that some other change we made recently might've
borked it. And in general, it seems like we've managed to load a
lot of potentially conflicting roles onto process latches.
Do we need to think harder about establishing rules for multiplexed
use of the process latch? I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch. Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2019-04-16 22:55:31 | Re: hyrax vs. RelationBuildPartitionDesc |
Previous Message | Robert Haas | 2019-04-16 22:40:44 | Re: block-level incremental backup |