rare crash - FailedAssertion snapbuild.c Line: 580

From: Erik Rijkers <er(at)xs4all(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: rare crash - FailedAssertion snapbuild.c Line: 580
Date: 2018-08-29 15:43:17
Message-ID: 4e588d5a81a61c799353e1adf5a126bb@xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used
last year [1] - now in a perl incarnation). Most of the time the
replication is stable and finishes correctly but sometimes (rarely) I
get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I
wanted to report it anyway as in the course of the last few months I
have seen it several times, on several machines. Always rarely, always
postgres 11 (I did not try other versions).

Erik Rijkers

[1]
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl

source/version: bf2d0462cd73 / REL_11_STABLE (compiled 20180828_1629)

1 primary, 4 replicas (on 1 host).
Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres
then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres
then: wait for log-repl sync.
(I also added extra data type columns to the basic pgbench tables to
test all the different data types)

Below is a grep for the assert message, with 15 surrounding lines from
the latest occurrence.

$ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B
15 -Ei 'trap|assert' 65*/logfile.65* | less
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG:
logical decoding found consistent point at 0/89E038E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG:
received replication command: CREATE_REPLICATION_SLOT
"sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG:
logical decoding found initial starting point at 0/8AF5D590
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL:
Waiting for transactions (approximately 1) older than 886297 to end.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG:
logical decoding found consistent point at 0/8B018310
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG:
received replication command: START_REPLICATION SLOT
"sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1',
publication_names '"pub1_6516"')
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG:
starting logical decoding for slot "sub2_6517_6517_18834_sync_18793"
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL:
Streaming transactions committing after 0/89E03918, reading WAL from
0/841EE960.
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG:
logical decoding found consistent point at 0/841EF3E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL:
Logical decoding will begin using saved snapshot.
6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG:
received replication command: CREATE_REPLICATION_SLOT
"sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG:
logical decoding found consistent point at 0/8C6BA408
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516:TRAP:
FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))",
File: "snapbuild.c", Line: 580)
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:
server process (PID 139541) was terminated by signal 6: Aborted
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL:
Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:
terminating any other active server processes
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT:
slot "sub2_6517_6517", output plugin "pgoutput", in the change callback,
associated LSN 0/8B11CF00
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.116 CEST [130398] LOG: all
server processes terminated; reinitializing
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.515 CEST [139548] LOG:
database system was interrupted; last known up at 2018-08-29 14:26:14
CEST

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergei Kornilov 2018-08-29 15:43:55 Re: Continue work on changes to recovery.conf API
Previous Message Tom Lane 2018-08-29 15:35:51 Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes