From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Dan Moschuk <dan(at)freebsd(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Core dump |
Date: | 2000-10-12 20:10:55 |
Message-ID: | 27214.971381455@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dan Moschuk <dan(at)freebsd(dot)org> writes:
> Sparc solaris 2.7 with postgres 7.0.2
> It seems to be reproducable, the server crashes on us at a rate of about
> every few hours.
That's a very bizarre backtrace. Why the multiple levels of recursive
entry to the quickdie() signal handler? I wonder if you aren't looking
at some kind of Solaris bug --- perhaps it's not able to cope with a
signal handler turning around and issuing new kernel calls.
The core file you are looking at is probably *not* from the original
failure, whatever that is. The sequence is probably
1. Some backend crashes for unknown reason, dumping core.
2. Postmaster observes messy death of a child, decides that mass suicide
followed by restart is called for. Postmaster sends SIGUSR1 to all
remaining backends to make them commit hara-kiri.
3. One or more other backends crash trying to obey postmaster's command.
The corefile left for you to examine comes from whichever crashed
last.
So there are at least two problems here, but we only have evidence of
the second one.
Since the problem is fairly reproducible, I'd suggest you temporarily
dike out the elog(NOTICE) call in quickdie() (in
src/backend/tcop/postgres.c), which will probably allow the backends
to honor SIGUSR1 without dumping core. Then you have a shot at seeing
the core from the original failure.
Assuming that this works (ie, you find a core that's not got anything
to do with quickdie()), I'd suggest an inquiry to Sun about whether
their signal handler logic hasn't got a problem with write() issued
from inside a signal handler. Meanwhile let us know what the new
backtrace shows.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Joseph Shraibman | 2000-10-12 20:17:28 | Re: [INTERFACES] JDBC Large ResultSet problem + BadTimeStamp Patch |
Previous Message | Peter Eisentraut | 2000-10-12 19:46:21 | Precedence of '|' operator (was Re: [patch,rfc] binary operators on integers) |