Re: Upgrade 10.5->10.6 : db crash BUS ERROR (sig 10), reproducible

From: Peter <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>
To: pgsql-admin(at)postgresql(dot)org
Cc: pgsql(at)FreeBSD(dot)org
Subject: Re: Upgrade 10.5->10.6 : db crash BUS ERROR (sig 10), reproducible
Date: 2019-03-08 01:20:12
Message-ID: 20190308012012.GA49481@gate.oper.dinoex.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Tom, Andrew,

much thanks for the replies! Alright, lets fill in some concrete
data:

> I'm assuming from the CC that this is on FreeBSD, but on what
> architecture?

When on my evening errands I recognized that I should have mentioned
this - FreeBSD is correct; it is built on amd64 for i386, and run on
i386.

Version:
FreeBSD 11.2-RELEASE-p9 #0 r343946M#C51:82
Build-Options:
OPTIONS_FILE_UNSET+=DEBUG
OPTIONS_FILE_UNSET+=DOCS
OPTIONS_FILE_UNSET+=DTRACE
OPTIONS_FILE_SET+=GSSAPI
OPTIONS_FILE_SET+=INTDATE
OPTIONS_FILE_UNSET+=LDAP
OPTIONS_FILE_SET+=NLS
OPTIONS_FILE_UNSET+=OPTIMIZED_CFLAGS
OPTIONS_FILE_UNSET+=PAM
OPTIONS_FILE_SET+=SSL
OPTIONS_FILE_SET+=TZDATA
OPTIONS_FILE_SET+=XML
Extra Compiler-Options:
-march=pentium3
Init-Options:
--data-checksums --encoding=utf-8 --lc-collate=de_DE.UTF-8
--lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8
--lc-monetary=en_US.UTF-8 --lc-numeric=en_US.UTF-8
--lc-time=en_US.UTF-8
Run-Options:
-w -m fast -o --config_file=/usr/local/etc/postgresql/postgresql.conf

Furthermore, FreeBSD did impose a change for R. 10.6: it forces the
use of gcc on i386 (gcc-8 in this case). Earlier versions were built
with system compiler Clang. The commitlog says this about the matter:

! r484807 | girgen | 2018-11-12 16:54:19 +0100 (Mon, 12 Nov 2018) | 5 lines
!
! Fix build problems on i386
!
! Use GCC seems to be proper way to do it. SSE2 would not be available
! for all CPU:s.

> Did it drop a core file (look in the data dir for postgres.core) and if
> so can you get a backtrace?

Looking... yes, there is a core. Lets grab a first-fault core,
as that one obviousely is from the failed recover:

! (gdb) core postgres.core.1st
! Core was generated by `postgres: bgworker: parallel worker for PID 68755 '.
! Program terminated with signal 10, Bus error.
! Reading symbols from <etc etc>
! #0 0x0838bdf2 in pg_checksum_page ()
! (gdb) bt
! #0 0x0838bdf2 in pg_checksum_page ()
! #1 0x0838a2b8 in PageIsVerified ()
! #2 0x5a824500 in ?? ()
! #3 0x00000000 in ?? ()

The second one looks this way:

! (gdb) core postgres.core
! Core was generated by `postgres: startup process recovering 000000010000002C000000C6'.
! Program terminated with signal 10, Bus error.
! Reading symbols from <lots of files>
! #0 0x0838bdf2 in pg_checksum_page ()
! (gdb) bt
! #0 0x0838bdf2 in pg_checksum_page ()
! #1 0x0838a2b8 in PageIsVerified ()
! #2 0x59e14500 in ?? ()
! #3 0x00000000 in ?? ()

Anything more I can do here? (Advice on how to build with debugging
support is appreciated.)

> You can check whether your CPU supports SSE2 by looking at the Features=
> line in /var/run/dmesg.boot. It seems unlikely that it does not, because
> SSE2 was introduced in 2000 with the Pentium 4.

No need to check; I am absolutely certain that it does NOT.
https://www.asus.com/supportonly/CUV4X-DLS/HelpDesk_CPU/

But, Your explanation seems not to answer the fundamental question: if
the database at 10.6 is still supposed to be able to run without SSE2?

> It seems pretty unlikely that that'd have anything to do with a
> bus-error failure, anyway. But this report contains far too little
> information to let anyone do anything but speculate.

Whateever information You like to have, just ask and I will gladly do
my best to obtain it, as I get around. (This is a reproducible on a
very well maintained piece of software - this is rather fun.)

Some more experiments & observations:

The crash happens at a specific query - I get parse,bind, but no execute
timing.
Furthermore, when I try and set

! max_parallel_workers_per_gather = 0

then the query goes thru and delivers proper results. But then after
few minutes I get this one:

! postgres[71256]: [8-1] :[] LOG: 00000: checkpointer process (PID 71258)
! was terminated by signal 10: Bus error

Different approach, same result:

! dynamic_shared_memory_type = posix -> crash immediate
! dynamic_shared_memory_type = sysv -> crash immediate
! dynamic_shared_memory_type = mmap -> crash immediate
! dynamic_shared_memory_type = none -> crash later in checkpointer

regards,
PMc

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Andrew Gierth 2019-03-08 02:35:33 Re: Upgrade 10.5->10.6 : db crash BUS ERROR (sig 10), reproducible
Previous Message Ron 2019-03-08 00:02:55 Re: Upgrade 10.5->10.6 : db crash BUS ERROR (sig 10), reproducible