From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Another reason why the recovery tests take a long time |
Date: | 2017-06-26 16:32:00 |
Message-ID: | 21344.1498494720@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs. It goes like this:
* While streaming from master to slave, test script shuts down master
while slave is left running. We soon restart the master, but meanwhile:
* slave's walreceiver process fails, reporting
2017-06-26 16:06:50.209 UTC [13209] LOG: replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL: End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL: could not send end-of-streaming message to primary: no COPY in progress
* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one
* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver. In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.
* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one. But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).
So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver. But I wonder whether it's intentional that the old
walreceiver dies in the first place. That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2017-06-26 16:44:44 | Re: GSoC 2017: Foreign Key Arrays |
Previous Message | Alexander Korotkov | 2017-06-26 15:18:08 | Re: Pluggable storage |