From: | chris+postgresql(at)qwirx(dot)com |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #13459: Replaying WAL logs can hang on startup |
Date: | 2015-06-22 08:42:01 |
Message-ID: | 20150622084201.3874.74819@wrigleys.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 13459
Logged by: Chris Wilson
Email address: chris+postgresql(at)qwirx(dot)com
PostgreSQL version: 9.4.1
Operating system: Linux 2.6.32-220.30.1.el6.x86_64
Description:
Dear developers,
I restored a standby from a pg_basebackup made from the master the previous
morning. As a result, it had a lot of WAL logs to catch up on.
At one point it hangs while restoring logs. It normally takes a few seconds
to process a 16 MB WAL segment, but on this one, it was "recovering
0000000100000CEC00000025" for 7 minutes now with no log output at all.
I restarted it at least once and it reached the same point and hung again. I
attached gdb and saw that it was hung here:
(gdb) bt
#0 0x0000003f358e1433 in __select_nocancel () from /lib64/libc.so.6
#1 0x00000000008be748 in pg_usleep ()
#2 0x000000000074ef9f in ForgetRelationFsyncRequests ()
#3 0x000000000074d0b0 in mdunlink ()
#4 0x000000000074faa5 in smgrdounlink ()
#5 0x00000000004d5b3a in xact_redo_commit_internal ()
#6 0x00000000004d5c36 in xact_redo_commit ()
#7 0x00000000004d5f37 in xact_redo ()
#8 0x00000000004eb36a in StartupXLOG ()
#9 0x00000000006eac29 in StartupProcessMain ()
#10 0x00000000004f9eeb in AuxiliaryProcessMain ()
#11 0x00000000006e9ea1 in StartChildProcess ()
#12 0x00000000006e50ab in PostmasterMain ()
#13 0x000000000064e693 in main ()
Browsing the code, I saw this interesting comment in
ForgetRelationFsyncRequests:
void
ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
{
if (pendingOpsTable)
{
/* standalone backend or startup process: fsync state is
local */
RememberFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC);
}
else if (IsUnderPostmaster)
{
/*
* Notify the checkpointer about it. If we fail to queue
the cancel
* message, we have to sleep and try again ... ugly, but
hopefully
* won't happen often.
*
* XXX should we CHECK_FOR_INTERRUPTS in this loop?
Escaping with an
* error would leave the no-longer-used file still present
on disk,
* which would be bad, so I'm inclined to assume that the
checkpointer
* will always empty the queue soon.
*/
while (!ForwardFsyncRequest(rnode, forknum,
FORGET_RELATION_FSYNC))
pg_usleep(10000L); /* 10 msec seems a good
number */
/*
* Note we don't wait for the checkpointer to actually
absorb the
* cancel message; see mdsync() for the implications.
*/
}
}
My guess is that ForwardFsyncRequest() is continually returning false, and
this code is stuck forever. I noticed that it says that "I'm inclined to
assume that the checkpointer
will always empty the queue soon", but there is no checkpointer running
during recovery, is there?
Although I managed to recover the system by restoring from a different
backup the next day, bypassing this logfile, I think I still have the copy
that hangs on this logfile so I should be able to conduct further
investigations if required. Please let me know.
Cheers, Chris.
From | Date | Subject | |
---|---|---|---|
Next Message | 1809563951 | 2015-06-22 13:15:52 | BUG #13460: ERROR: could not find block containing chunk 0x2930072 |
Previous Message | gregburek | 2015-06-22 04:33:37 | BUG #13458: postgres_fdw with usermapping dumped with pg_dump --no-owners results in dump file with implicit own |