Re: Crash dumps

From: Radoslaw Smogura <rsmogura(at)softperience(dot)eu>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Crash dumps
Date: 2011-07-04 13:59:54
Message-ID: 20110704140000.88611B5DBD8@mail.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Information if backend crashed should go fast to master, to kill others as fast as possible. This what i thought is to use socket urgent data, but this require to span small thread in master (i think oob data may not be processed in secure way).

From one hand processing core dump may be good, but from other hand those may take huge area. Using it in any case will require to build PostgreSQL with debugging symbols.

Regards,
Radoslaw Smogura
(mobile)

-----Original Message-----
From: Craig Ringer
Sent: 4 lipca 2011 13:57
To: Radosław Smogura
Cc: PG Hackers
Subject: Re: [HACKERS] Crash dumps

On 4/07/2011 7:03 PM, Radosław Smogura wrote:

> Actually this, what I was thinking about was, to add dumping of GUC,
> etc. List of mappings came from when I tired to mmap PostgreSQL, and due
> to many of errors, which sometimes occurred in unexpected places, I was
> in need to add something that will be useful for me and easy to analyse
> (I could simple find pointer, and then check which region failed). The
> idea to try to evolve this come later.

Why not produce a tool that watches the datadir for core files and
processes them? Most but not all of the info you listed should be able
to be extracted from a core file. Things like GUCs should be extractable
with a bit of gdb scripting - and with much less chance of crashing than
trying to read them from a possibly corrupt heap within a crashing backend.

To capture any information not available from the core, you can enlist
the postmaster's help. It gets notified when a child crashes and should
be able to capture things like the memory and disk state. See void
reaper(SIGNAL_ARGS) in postmaster.c and HandleChildCrash(...) . If
nothing else, the postmaster could probably fork a "child crashed"
helper to collect data, analyse the core file, email the report to the
admin, etc.

About the only issue there is that the postmaster relies on the exit
status to trigger the reaper code. Once an exit status is available, the
crashed process is gone, so the free memory will reflect the memory
state after the backend dies, and shared memory's state will have moved
on from how it was when the backend was alive.

For that reason, it'd be handy if a backend could trap SIGSEGV and
reliably tell the postmaster "I'm crashing!" so the postmaster could
fork a helper to capture any additional info the backend needs to be
alive for. Then the helper can gcore() the backend, or the backend can
just clear the SIGSEGV handler and kill(11) its self to keep on crashing
and generate a core.

Unfortunately, "reliably" and "segfault" don't go together. You don't
want a crashing postmaster writing to shared memory so it can't use shm
to tell the postmaster it's dying. Signals are ... interesting ... at
the best of times, but would probably still be the best bet. The
postmaster could install a SIGUSR[whatever] or RT signal handler that
takes a siginfo so it knows the pid of the signal sender. The crashing
backend could signal the postmaster with an agreed signal to say "I'm
crashing" and let the postmaster clean it up. The problem with this is
that a lost signal (for any reason) would cause a zombie backend to hang
around waiting to be killed by a postmaster that never heard it was
crashing.

BTW, the win32 crash dump handler would benefit from being able to use
some of the same facilities. In particular, being able to tell the
postmaster "Argh, ogod I'm crashing, fork something to dump my core!"
rather than trying to self-dump would be great. It'd also allow the
addition of extra info like GUC data, last few lines of logs etc to the
minidump, something that the win32 crash dump handler cannot currently
do safely.

--
Craig Ringer

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2011-07-04 14:09:00 Re: non-superuser reserved connections? connection pools?
Previous Message Heikki Linnakangas 2011-07-04 13:14:11 Re: Potential NULL dereference found in typecmds.c