From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: What is happening on buildfarm member crake? |
Date: | 2014-01-20 01:22:42 |
Message-ID: | CA+TgmoZBXXVMF-oH=JEv5BsshE7P8PTz25-1qjVWeMh8LbhRHg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> Also crake does produce backtraces on core dumps, and they are at the
> bottom of the buildfarm log. The latest failure backtrace is reproduced
> below.
>
> ================== stack trace:
> /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==================
> [New LWP 12584]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `postgres: buildfarm
> contrib_regression_test_shm_mq'.
> Program terminated with signal 11, Segmentation fault.
> #0 SetLatch (latch=0x1c) at pg_latch.c:509
> 509 if (latch->is_set)
> #0 SetLatch (latch=0x1c) at pg_latch.c:509
> #1 0x000000000064c04e in procsignal_sigusr1_handler
> (postgres_signal_arg=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289
> #2 <signal handler called>
> #3 _dl_fini () at dl-fini.c:190
> #4 0x000000361ba39931 in __run_exit_handlers (status=0,
> listp=0x361bdb1668, run_list_atexit=true) at exit.c:78
> #5 0x000000361ba399b5 in __GI_exit (status=<optimized out>) at
> exit.c:100
> #6 0x00000000006485a6 in proc_exit (code=0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143
> #7 0x0000000000663abb in PostgresMain (argc=<optimized out>,
> argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq",
> username=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225
> #8 0x000000000062220f in BackendRun (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083
> #9 BackendStartup (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
> #10 ServerLoop () at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583
> #11 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238
> #12 0x000000000045e2e8 in main (argc=3, argv=0x12b7430) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205
Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit. And indeed
that seems like the sort of thing that would result in dying horribly
in just this way. The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either. However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory. But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died. And that could
happen even after we've detached shared memory.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2014-01-20 01:25:28 | Re: plpgsql.warn_shadow |
Previous Message | Andrew Dunstan | 2014-01-20 00:53:19 | Re: What is happening on buildfarm member crake? |