Re: PosgreSQL backend process crashed with signal 9

From: Pavel Suderevsky <psuderevsky(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Aleksey Romanov <drednout(dot)by(at)gmail(dot)com>
Subject: Re: PosgreSQL backend process crashed with signal 9
Date: 2016-04-06 14:36:31
Message-ID: CAEBTBzuuSR7X70sK_iVmVLg-yhOAABG3Yrbdj+jhN=-yvx9eYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I've faced similar behaviour in much more simple situation. It crashed on
simple nested loop. And all other child postmaster processes were
terminated likewise.

Repoduced on:
1. 9.5.1 + Ubuntu 14.04.4 LTS
2. 9.5.1 + CentOS 7.2.1511
2. 9.5.2 + CentOS 7.2.1511

DDL:

> platform_db=# \d node_statuses
> Table "platform_db.node_statuses"
> Column | Type | Modifiers
>
>
> -----------------------+-------------------------+---------------------------------
> node_id | integer | not null
> status_id | integer | not null
> msg | character varying(1024) | default NULL::character
> varying
> node_status_timestamp | bigint | not null
> Indexes:
> "NODE_STATUSES_NODE_ID_FK" btree (node_id)
> "NODE_STATUSES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
> node_status_timestamp)
> "NODE_STATUSES_TIMESTAMP_IDX" btree (node_status_timestamp)
>
> platform_db=# \d node_status_values
> Table "platform_db.node_status_values"
> Column | Type | Modifiers
> ----------------------+---------+-----------
> node_id | integer | not null
> status_id | integer | not null
> node_value_timestamp | bigint | not null
> Indexes:
> "NODE_STATUS_VALUES_NODE_ID_FK" btree (node_id)
> "NODE_STATUS_VALUES_NODE_ID_TIMESTAMP_IDX" btree (node_id,
> node_value_timestamp)

SQL:

> SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
> ns.node_id = nsv.node_id where ns.node_id = 1147;

Postgres log:

> 2016-04-05 18:37:31 MSK LOG: server process (PID 2848) was terminated by
> signal 9: Killed
> 2016-04-05 18:37:31 MSK DETAIL: Failed process was running: COMMIT
> 2016-04-05 18:37:32 MSK LOG: terminating any other active server
> processes
> 2016-04-05 18:37:32 MSK WARNING: terminating connection because of crash
> of another server process
> 2016-04-05 18:37:32 MSK DETAIL: The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possib
> ly corrupted shared memory.
> 2016-04-05 18:37:32 MSK HINT: In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db(at)platform_db [unknown] %
> WARNING: terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db(at)platform_db [unknown] %
> DETAIL: The postmaster has commanded this server process to roll back the
> current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:32 MSK 127.0.0.1 platform_db(at)platform_db [unknown] %
> HINT: In a moment you should be able to reconnect to the database and
> repeat your command.
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % WARNING:
> terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % DETAIL: The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % HINT: In a moment
> you should be able to reconnect to the database and repeat your command.
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % WARNING:
> terminating connection because of crash of another server process
> 2016-04-05 18:37:32 MSK ::1 zabbix(at)zabbix [unknown] % DETAIL: The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> 2016-04-05 18:37:37 MSK LOG: statistics collector process (PID 1012) was
> terminated by signal 9: Killed
> 2016-04-05 18:37:37 MSK LOG: all server processes terminated;
> reinitializing
> 2016-04-05 18:37:39 MSK LOG: database system was interrupted; last known
> up at 2016-04-05 18:31:44 MSK
> 2016-04-05 18:37:39 MSK ::1 zabbix(at)zabbix [unknown] % FATAL: the
> database system is in recovery mode
> 2016-04-05 18:37:41 MSK LOG: database system was not properly shut down;
> automatic recovery in progress
> 2016-04-05 18:37:41 MSK LOG: redo starts at 298/AB146828
> 2016-04-05 18:37:41 MSK ::1 zabbix(at)zabbix [unknown] % FATAL: the
> database system is in recovery mode
> 2016-04-05 18:37:44 MSK [local] postgres(at)platform_db [unknown] % FATAL:
> the database system is in recovery mode
> 2016-04-05 18:37:46 MSK LOG: invalid record length at 298/C07ACA78
> 2016-04-05 18:37:46 MSK LOG: redo done at 298/C07ACA50
> 2016-04-05 18:37:46 MSK LOG: last completed transaction was at log time
> 2016-04-05 18:37:28.455638+03
> 2016-04-05 18:37:48 MSK ::1 zabbix(at)zabbix [unknown] % FATAL: the
> database system is in recovery mode
> 2016-04-05 18:38:08 MSK LOG: MultiXact member wraparound protections are
> now enabled
> 2016-04-05 18:38:08 MSK LOG: autovacuum launcher started
> 2016-04-05 18:38:08 MSK LOG: database system is ready to accept
> connections

In CentOS just:

> 2016-04-06 13:58:58 GMT [local] postgres(at)platform_db psql % LOG: could
> not send data to client: Broken pipe
> 2016-04-06 13:58:58 GMT [local] postgres(at)platform_db psql % STATEMENT:
> SELECT * FROM node_statuses ns INNER JOIN node_status_values nsv ON
> ns.node_id = nsv.node_id where ns.node_id = 1147;
> 2016-04-06 13:58:58 GMT [local] postgres(at)platform_db psql % FATAL:
> connection to client lost

System log [ubuntu]:

> Apr 5 18:37:30 db_server kernel: [3905852.822122] postgres invoked
> oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> Apr 5 18:37:30 db_server kernel: [3905852.822131] postgres cpuset=/
> mems_allowed=0-1
> Apr 5 18:37:30 db_server kernel: [3905852.822136] CPU: 3 PID: 9191 Comm:
> postgres Not tainted 3.13.0-77-generic #121-Ubuntu
> Apr 5 18:37:30 db_server kernel: [3905852.822139] Hardware name:
> Supermicro H8DMR-82/H8DMR-82, BIOS 080014 12/15/2006
> ...
> Apr 5 18:37:31 db_server kernel: [3905854.860083] Out of memory: Kill
> process 2847 (postgres) score 391 or sacrifice child
> Apr 5 18:37:31 db_server kernel: [3905854.860201] Killed process 2847
> (postgres) total-vm:6613344kB, anon-rss:34224kB, file-rss:6387036kB

Counts:

> platform_db=# SELECT count(*) FROM node_statuses ns where node_id = 1147;
> count
> -------
> 6819
> (1 row)
> platform_db=# SELECT count(*) FROM node_status_values nsv where node_id =
> 1147;
> count
> -------
> 6814
> (1 row)
> platform_db=# SELECT count(*) FROM node_statuses ns INNER JOIN
> node_status_values nsv ON ns.node_id = nsv.node_id where ns.node_id = 1147;
>
> count
> ----------
> 46423776
> (1 row)

Query plan:

> Nested Loop (cost=477.14..570040.76 rows=45261752 width=53) (actual
> time=3.223..252614.947 rows=46382904 loops=1)
> -> Bitmap Heap Scan on node_statuses ns (cost=241.60..2336.38
> rows=6862 width=37) (actual time=1.724..21.248 rows=6813 loops=1)
> Recheck Cond: (node_id = 1147)
> Heap Blocks: exact=1397
> -> Bitmap Index Scan on "NODE_STATUSES_NODE_ID_FK"
> (cost=0.00..239.89 rows=6862 width=0) (actual time=1.472..1.472 rows=6813
> loops=1)
> Index Cond: (node_id = 1147)
> -> Materialize (cost=235.54..1948.97 rows=6596 width=16) (actual
> time=0.002..12.092 rows=6808 loops=6813)
> -> Bitmap Heap Scan on node_status_values nsv
> (cost=235.54..1915.99 rows=6596 width=16) (actual time=1.471..17.282
> rows=6808 loops=1)
> Recheck Cond: (node_id = 1147)
> Heap Blocks: exact=1357
> -> Bitmap Index Scan on "NODE_STATUS_VALUES_NODE_ID_FK"
> (cost=0.00..233.89 rows=6596 width=0) (actual time=1.238..1.238 rows=6808
> loops=1)
> Index Cond: (node_id = 1147)
> Planning time: 0.210 ms
> Execution time: 334897.948 ms

Memory on the host:

> total used free shared buffers cached
> Mem: 15G 15G 453M 2.7G 195M 12G
> -/+ buffers/cache: 2.2G 13G
> Swap: 0B 0B 0B

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2016-04-06 14:46:53 Re: PosgreSQL backend process crashed with signal 9
Previous Message Schuch, Mathias (Mathias) 2016-04-06 09:07:05 Re: BUG #14050: "could not reserve shared memory region" in postgresql log