From: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: checkpointer code behaving strangely on postmaster -T |
Date: | 2012-05-10 15:04:54 |
Message-ID: | 1336659379-sup-7447@alvh.no-ip.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> > I noticed while doing some tests that the checkpointer process does not
> > recover very nicely after a backend crashes under postmaster -T (after
> > all processes have been kill -CONTd, of course, and postmaster told to
> > shutdown via Ctrl-C on its console). For some reason it seems to get
> > stuck on a loop doing sleep(0.5s) In other case I caught it trying to
> > do a checkpoint, but it was progressing a single page each time and then
> > sleeping. In that condition, the checkpoint took a very long time to
> > finish.
>
> Is this still a problem as of HEAD? I think I've fixed some issues in
> the checkpointer's outer loop logic, but not sure if what you saw is
> still there.
Yep, it's still there as far as I can tell. A backtrace from the
checkpointer shows it's waiting on the latch.
It seems to me that the bug is in the postmaster state machine rather
than checkpointer itself. After a few false starts, this seems to fix
it:
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)
signal_child(WalWriterPID, SIGTERM);
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ if (FatalError && CheckpointerPID != 0)
+ signal_child(CheckpointerPID, SIGUSR2);
/*
* If we're in recovery, we can't kill the startup process
@@ -2178,6 +2180,8 @@ pmdie(SIGNAL_ARGS)
signal_child(WalReceiverPID, SIGTERM);
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+ if (FatalError && CheckpointerPID != 0)
+ signal_child(CheckpointerPID, SIGUSR2);
if (pmState == PM_RECOVERY)
{
/* only checkpointer is active in this state */
Note that since checkpointer can only be running after we enter
FatalError when the -T (send SIGSTOP instead of SIGQUIT) switch is used,
this bug doesn't seem to affect normal usage. (I'm not sure SIGUSR2 is
the most appropriate signal to send at this time -- since we're in
FatalError, probably SIGQUIT is better suited.)
One good thing is that when I patched postmaster in a different way
(which I later realized to be bogus), I caused it to die with an
assertion while checkpointer was still running; the debug output let me
know that checkpointer went away immediately.
--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
From | Date | Subject | |
---|---|---|---|
Next Message | MauMau | 2012-05-10 15:07:59 | Re: Can pg_trgm handle non-alphanumeric characters? |
Previous Message | Tom Lane | 2012-05-10 15:04:47 | Re: Draft release notes complete |