From: | Dan Moschuk <dan(at)freebsd(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Dan Moschuk <dan(at)freebsd(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Core dump |
Date: | 2000-10-12 20:47:53 |
Message-ID: | 20001012164752.A3004@spirit.jaded.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
| > Sparc solaris 2.7 with postgres 7.0.2
| > It seems to be reproducable, the server crashes on us at a rate of about
| > every few hours.
|
| That's a very bizarre backtrace. Why the multiple levels of recursive
| entry to the quickdie() signal handler? I wonder if you aren't looking
| at some kind of Solaris bug --- perhaps it's not able to cope with a
| signal handler turning around and issuing new kernel calls.
I'm not sure that is the issue, see below.
| The core file you are looking at is probably *not* from the original
| failure, whatever that is. The sequence is probably
|
| 1. Some backend crashes for unknown reason, dumping core.
|
| 2. Postmaster observes messy death of a child, decides that mass suicide
| followed by restart is called for. Postmaster sends SIGUSR1 to all
| remaining backends to make them commit hara-kiri.
|
| 3. One or more other backends crash trying to obey postmaster's command.
| The corefile left for you to examine comes from whichever crashed
| last.
|
| So there are at least two problems here, but we only have evidence of
| the second one.
|
| Since the problem is fairly reproducible, I'd suggest you temporarily
| dike out the elog(NOTICE) call in quickdie() (in
| src/backend/tcop/postgres.c), which will probably allow the backends
| to honor SIGUSR1 without dumping core. Then you have a shot at seeing
| the core from the original failure.
I will try this, however the database is currently running under light load.
Only under high load does postgres start to choke, and eventually die.
| Assuming that this works (ie, you find a core that's not got anything
| to do with quickdie()), I'd suggest an inquiry to Sun about whether
| their signal handler logic hasn't got a problem with write() issued
| from inside a signal handler. Meanwhile let us know what the new
| backtrace shows.
I wrote a quick test program to test this theory. Below is the code and the
output.
#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
static void moo (int);
int
main (void)
{
signal(SIGUSR1, moo);
raise(SIGUSR1);
}
static void
moo (cow)
int cow;
{
printf("Getting ready for write()\n");
write(STDOUT_FILENO, "Hello!\n", 7);
printf("Done.\n");
}
static void
moo (cow)
int cow;
{
printf("Getting ready for write()\n");
write(STDOUT_FILENO, "Hello!\n", 7);
printf("Done.\n");
}
eclipse% ./x
Getting ready for write()
Hello!
Done.
eclipse%
It would appear from that very rough test program that solaris doesn't mind
system calls from within a signal handler.
--
Man is a rational animal who always loses his temper when he is called
upon to act in accordance with the dictates of reason.
-- Oscar Wilde
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Kreen | 2000-10-12 21:11:32 | Re: Precedence of '|' operator (was Re: [patch, rfc] binary operators on integers) |
Previous Message | Marko Kreen | 2000-10-12 20:30:48 | Re: [patch,rfc] binary operators on integers |