Re: BUG #13459: Replaying WAL logs can hang on startup

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: chris+postgresql(at)qwirx(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13459: Replaying WAL logs can hang on startup
Date: 2015-06-22 22:25:03
Message-ID: 6574.1435011903@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

chris+postgresql(at)qwirx(dot)com writes:
> I restored a standby from a pg_basebackup made from the master the previous
> morning. As a result, it had a lot of WAL logs to catch up on.
> At one point it hangs while restoring logs. It normally takes a few seconds
> to process a 16 MB WAL segment, but on this one, it was "recovering
> 0000000100000CEC00000025" for 7 minutes now with no log output at all.

Hmm ...

> My guess is that ForwardFsyncRequest() is continually returning false, and
> this code is stuck forever. I noticed that it says that "I'm inclined to
> assume that the checkpointer
> will always empty the queue soon", but there is no checkpointer running
> during recovery, is there?

There is supposed to be one once we have reached a consistent state; see
SetForwardFsyncRequests(). AFAICS it should be impossible to reach the
wait you're seeing unless the startup process's local pendingOpsTable has
been removed by SetForwardFsyncRequests(), and the caller of that should
have pinged the postmaster to start up a checkpointer.

If you can repro this easily, please look to see whether there's a
checkpointer, and if not, what state the postmaster is in (pmState,
CheckpointerPID, Shutdown, FatalError, RecoveryError might be
interesting). If there is a checkpointer, then that's what to be
looking at.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message David Gould 2015-06-22 22:32:56 Re: Incomplete Explain for delete
Previous Message Tom Lane 2015-06-22 21:20:05 Re: BUG #13461: Error message appears to use incorrect values