Re: BUG #16811: Severe reproducible server backend crash

From: James Inform <james(dot)inform(at)pharmapp(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16811: Severe reproducible server backend crash
Date: 2021-01-18 08:30:57
Message-ID: 78c1f0db-8417-7bc3-83e4-1673872c0793@pharmapp.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Is there any further progress on this topic?

> Tom Lane <mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us>
> 7. Januar 2021 um 18:22
>
> I can reproduce it without anything extra. What's needed is to run
> the problematic statement in extended query mode, which you can
> do like this:
>
> $ cat foo.sql
> do $$ begin rollback; end $$;
>
> $ pgbench -n -f foo.sql -M prepared
> pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
> perhaps the backend died while processing
>
> That lnext() should certainly not find pstmt->stmts to be NIL,
> seeing that we are inside a loop over that list. Ergo, something
> is clobbering this active portal. A bit of gdb'ing says the clobber
> happens here:
>
> #0 AtAbort_Portals () at portalmem.c:833
> (this appears to be inlined code from PortalReleaseCachedPlan)
> #1 0x00000000005a4ce2 in AbortTransaction () at xact.c:2711
> #2 0x00000000005a55d5 in AbortCurrentTransaction () at xact.c:3322
> #3 0x00000000006d1557 in _SPI_rollback (chain=<optimized out>) at
> spi.c:326
> #4 0x00007feef9e851c5 in exec_stmt_rollback (stmt=0x2babca8,
> estate=0x7fff35e55ee0) at pl_exec.c:4961
> #5 exec_stmts (estate=0x7fff35e55ee0, stmts=0x2babd80) at pl_exec.c:2081
> #6 0x00007feef9e863cb in exec_stmt_block (estate=0x7fff35e55ee0,
> block=0x2babdd8) at pl_exec.c:1904
> #7 0x00007feef9e864bb in exec_toplevel_block (
> estate=estate(at)entry=0x7fff35e55ee0, block=0x2babdd8) at pl_exec.c:1602
> #8 0x00007feef9e86ced in plpgsql_exec_function
> (func=func(at)entry=0x2ba7c60,
> fcinfo=fcinfo(at)entry=0x7fff35e56060,
> simple_eval_estate=simple_eval_estate(at)entry=0x2bad6b0,
> simple_eval_resowner=simple_eval_resowner(at)entry=0x2b12e40,
> atomic=<optimized out>) at pl_exec.c:605
> #9 0x00007feef9e8fd58 in plpgsql_inline_handler (fcinfo=<optimized out>)
> at pl_handler.c:344
> #10 0x000000000091a540 in FunctionCall1Coll (flinfo=0x7fff35e561f0,
> collation=<optimized out>, arg1=<optimized out>) at fmgr.c:1141
> #11 0x000000000091aaa9 in OidFunctionCall1Coll (functionId=<optimized
> out>,
> collation=collation(at)entry=0, arg1=45120272) at fmgr.c:1419
> #12 0x000000000064df7e in ExecuteDoStmt (stmt=stmt(at)entry=0x2b07ed8,
> atomic=atomic(at)entry=false) at functioncmds.c:2027
> #13 0x000000000080fa14 in standard_ProcessUtility (pstmt=0x2b07e40,
> queryString=0x2b079a0 "do $$ begin rollback; end $$;",
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
> dest=0xa90540 <donothingDR>, qc=0x7fff35e56630) at utility.c:696
> #14 0x000000000080d044 in PortalRunUtility (portal=0x2b47240,
> pstmt=0x2b07e40,
> isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>,
> dest=0xa90540 <donothingDR>, qc=0x7fff35e56630) at pquery.c:1159
> #15 0x000000000080db24 in PortalRunMulti (portal=portal(at)entry=0x2b47240,
> isTopLevel=isTopLevel(at)entry=true,
> setHoldSnapshot=setHoldSnapshot(at)entry=false, dest=0xa90540 <donothingDR>,
> dest(at)entry=0x2adfa88, altdest=0xa90540 <donothingDR>,
> altdest(at)entry=0x2adfa88, qc=qc(at)entry=0x7fff35e56630) at pquery.c:1311
> #16 0x000000000080e937 in PortalRun (portal=portal(at)entry=0x2b47240,
> count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=true,
> run_once=run_once(at)entry=true, dest=dest(at)entry=0x2adfa88,
> altdest=altdest(at)entry=0x2adfa88, qc=0x7fff35e56630) at pquery.c:779
> #17 0x000000000080c77b in exec_execute_message
> (max_rows=9223372036854775807,
> portal_name=0x2adf670 "") at postgres.c:2196
> #18 PostgresMain (argc=argc(at)entry=1, argv=argv(at)entry=0x7fff35e569c0,
> dbname=<optimized out>, username=<optimized out>) at postgres.c:4452
>
> So I would say that the conditions under which AtAbort_Portals
> decides that it can destroy a portal rather than just mark it failed
> need to be reconsidered. It's not clear to me exactly how that
> should change though. Maybe Peter has more insight.
>
> regards, tom lane
> Thomas Munro <mailto:thomas(dot)munro(at)gmail(dot)com>
> 7. Januar 2021 um 11:14
> On Thu, Jan 7, 2021 at 10:25 PM PG Bug reporting form
>
> Thanks for the report. I happened to have DBeaver here and could
> reproduce this, and got the following core:
>
> #0 0x00005651d5c07927 in lnext (l=0x0, c=0x5651d794aca0) at
> ../../../src/include/nodes/pg_list.h:312
> 312 Assert(c >= &l->elements[0] && c < &l->elements[l->length]);
> (gdb) bt
> #0 0x00005651d5c07927 in lnext (l=0x0, c=0x5651d794aca0) at
> ../../../src/include/nodes/pg_list.h:312
> #1 0x00005651d5c096c9 in PortalRunMulti (portal=0x5651d79900a0,
> isTopLevel=true, setHoldSnapshot=false,
> dest=0x5651d60a1580 <donothingDR>, altdest=0x5651d60a1580
> <donothingDR>, qc=0x7ffc44cf7530) at pquery.c:1321
> #2 0x00005651d5c08b31 in PortalRun (portal=0x5651d79900a0, count=200,
> isTopLevel=true, run_once=false,
> dest=0x5651d7928378, altdest=0x5651d7928378, qc=0x7ffc44cf7530) at
> pquery.c:779
> #3 0x00005651d5c03fea in exec_execute_message
> (portal_name=0x5651d7927f60 "", max_rows=200) at postgres.c:2196
> #4 0x00005651d5c06d28 in PostgresMain (argc=1, argv=0x7ffc44cf7760,
> dbname=0x5651d79235f8 "postgres",
> username=0x5651d7953f58 "tmunro") at postgres.c:4405
> #5 0x00005651d5b481d8 in BackendRun (port=0x5651d794d100) at
> postmaster.c:4484
> #6 0x00005651d5b47b07 in BackendStartup (port=0x5651d794d100) at
> postmaster.c:4206
> #7 0x00005651d5b43f28 in ServerLoop () at postmaster.c:1730
> #8 0x00005651d5b43777 in PostmasterMain (argc=3, argv=0x5651d79215b0)
> at postmaster.c:1402
> #9 0x00005651d5a43279 in main (argc=3, argv=0x5651d79215b0) at main.c:209
> (gdb) f 1
> #1 0x00005651d5c096c9 in PortalRunMulti (portal=0x5651d79900a0,
> isTopLevel=true, setHoldSnapshot=false,
> dest=0x5651d60a1580 <donothingDR>, altdest=0x5651d60a1580
> <donothingDR>, qc=0x7ffc44cf7530) at pquery.c:1321
> 1321 if (lnext(portal->stmts, stmtlist_item) != NULL)
> (gdb) print portal->stmts
> $1 = (List *) 0x0
>
> I didn't have time to investigate whether this is the right fix, but
> this cargo cult change fixes the problem:
>
> --- a/src/backend/tcop/pquery.c
> +++ b/src/backend/tcop/pquery.c
> @@ -1318,7 +1318,7 @@ PortalRunMulti(Portal portal,
> * Increment command counter between queries, but not
> after the last
> * one.
> */
> - if (lnext(portal->stmts, stmtlist_item) != NULL)
> + if (portal->stmts && lnext(portal->stmts,
> stmtlist_item) != NULL)
> CommandCounterIncrement();
>
> Maybe something to do with commit
> 1cff1b95ab6ddae32faa3efe0d95a820dbfdc164. I can dig some more
> tomorrow if someone doesn't beat me to it.
> PG Bug reporting form <mailto:noreply(at)postgresql(dot)org>
> 7. Januar 2021 um 00:25
> The following bug has been logged on the website:
>
> Bug reference: 16811
> Logged by: James Inform
> Email address: james(dot)inform(at)pharmapp(dot)de
> PostgreSQL version: 13.1
> Operating system: Ubuntu 18.04 LTS AND macOS Catalina 10.15.7
> Description:
>
> First things first: Happy new year to all of you and stay healthy during
> these days.
>
> I have run into a severe backend crash that makes the whole PostgreSQL
> shutting down and restart with recovering message.
>
> This is reproducible on Mac using the latest version of the Postgresapp
> (www.postgresapp.com) which comes with PG13.1 as well as on Ubuntu with a
> self build PG 13.1 and also with the latest version from REL_13_STABLE
> branch.
>
> The issue doesn't exist on PG 12.5. Wether on Mac nor on Ubuntu.
>
> It's very severe because with the steps for reproduction it seems that you
> can tear down every PostgreSQL 13 server you have access to.
>
> It's enough to run the following simple anonymous block with an
> appropriate
> db tool:
>
> do
> $$
> begin
> rollback;
> end
> $$
> ;
>
> This works fine in psql or from python using psycopg2.
>
> But when using the DBeaver tool (www.dbeaver.io / Verson 7.3.2) to run the
> command, the whole PostgreSQL is teared down. It is not only with DBeaver,
> it is also reproducible with Lazarus/Delphi [yes, I am still using
> good old
> pascal :)] on Linux and Windows. With Lazarus I am using components that
> have a pg client library compiled into while DBeaver uses java
> libraries. I
> can reproduce the behaviour with these two tools and I would bet that
> there
> are many more tools out there that deliver the same result.
>
> Steps to reproduce:
> 1. Install or build PostgreSQL 13.1 on Ubuntu or use the Postgresapp on
> Mac.
> 2. Init a new PG13 cluster (initdb)
> 3. Download and install DBeaver on Linux, Mac or Windows
> 4. Connect to PostgreSQL using the postgres superuser
> 5. Execute the anonymous block from above
> 6. Voila ... you have forced PostgreSQL to shutdown and recover
>
> I have started postgres with core dump option: pg_ctl start -D /opt/data13
> -c
> So I have a core dump at hand. But using the bug form on
> www.postgresql.org
> I am not able to add an attachment. How can I provide you with such an
> attachment.
>
> Cheers, James
>
> --################################################################################
> -- Now the relevant log entries
> -- You can see that the server starts
> -- And just after executing the command from (in this case) DBeaver the
> corresponding error is logged
> --################################################################################
>
> pglinux(at)aggaserv:/opt/data13$ pg_ctl start -D /opt/data13/ -c
> waiting for server to start....2021-01-06 22:36:18.375 UTC [38787] LOG:
> starting PostgreSQL 13.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
> 2021-01-06 22:36:18.375 UTC [38787] LOG: listening on IPv6 address "::1",
> port 5432
> 2021-01-06 22:36:18.375 UTC [38787] LOG: listening on IPv4 address
> "127.0.0.1", port 5432
> 2021-01-06 22:36:18.386 UTC [38787] LOG: listening on Unix socket
> "/tmp/.s.PGSQL.5432"
> 2021-01-06 22:36:18.405 UTC [38788] LOG: database system was shut down at
> 2021-01-06 22:26:20 UTC
> 2021-01-06 22:36:18.417 UTC [38787] LOG: database system is ready to
> accept
> connections
> done
> server started
> pglinux(at)aggaserv:/opt/data13$ 2021-01-06 22:36:30.022 UTC [38787] LOG:
> server process (PID 38881) was terminated by signal 11: Segmentation fault
> 2021-01-06 22:36:30.022 UTC [38787] DETAIL: Failed process was running:
> do
> $$
> begin
> rollback;
> end
> $$
>
> 2021-01-06 22:36:30.022 UTC [38787] LOG: terminating any other active
> server processes
> 2021-01-06 22:36:30.022 UTC [38792] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:30.022 UTC [38792] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:30.022 UTC [38792] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:30.023 UTC [38880] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:30.023 UTC [38880] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:30.023 UTC [38880] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:30.024 UTC [38787] LOG: all server processes terminated;
> reinitializing
> 2021-01-06 22:36:30.043 UTC [38884] LOG: database system was interrupted;
> last known up at 2021-01-06 22:36:18 UTC
> 2021-01-06 22:36:30.088 UTC [38884] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2021-01-06 22:36:30.095 UTC [38884] LOG: invalid record length at
> 0/1693F78: wanted 24, got 0
> 2021-01-06 22:36:30.095 UTC [38884] LOG: redo is not required
> 2021-01-06 22:36:30.144 UTC [38787] LOG: database system is ready to
> accept
> connections
> 2021-01-06 22:36:30.686 UTC [38787] LOG: server process (PID 38977) was
> terminated by signal 11: Segmentation fault
> 2021-01-06 22:36:30.686 UTC [38787] DETAIL: Failed process was running:
> do
> $$
> begin
> rollback;
> end
> $$
>
> 2021-01-06 22:36:30.686 UTC [38787] LOG: terminating any other active
> server processes
> 2021-01-06 22:36:30.686 UTC [38906] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:30.686 UTC [38906] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:30.686 UTC [38906] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:30.686 UTC [38976] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:30.686 UTC [38976] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:30.686 UTC [38976] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:30.688 UTC [38787] LOG: all server processes terminated;
> reinitializing
> 2021-01-06 22:36:30.695 UTC [38979] LOG: database system was interrupted;
> last known up at 2021-01-06 22:36:30 UTC
> 2021-01-06 22:36:30.702 UTC [38979] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2021-01-06 22:36:30.709 UTC [38979] LOG: invalid record length at
> 0/1693FF0: wanted 24, got 0
> 2021-01-06 22:36:30.709 UTC [38979] LOG: redo is not required
> 2021-01-06 22:36:30.748 UTC [38787] LOG: database system is ready to
> accept
> connections
> 2021-01-06 22:36:31.894 UTC [38787] LOG: server process (PID 39159) was
> terminated by signal 11: Segmentation fault
> 2021-01-06 22:36:31.894 UTC [38787] DETAIL: Failed process was running:
> do
> $$
> begin
> rollback;
> end
> $$
>
> 2021-01-06 22:36:31.894 UTC [38787] LOG: terminating any other active
> server processes
> 2021-01-06 22:36:31.894 UTC [39158] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:31.894 UTC [39158] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:31.894 UTC [39158] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:31.894 UTC [38986] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:31.894 UTC [38986] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:31.894 UTC [38986] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:31.895 UTC [38787] LOG: all server processes terminated;
> reinitializing
> 2021-01-06 22:36:31.905 UTC [39161] LOG: database system was interrupted;
> last known up at 2021-01-06 22:36:30 UTC
> 2021-01-06 22:36:31.912 UTC [39161] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2021-01-06 22:36:31.919 UTC [39161] LOG: invalid record length at
> 0/1694080: wanted 24, got 0
> 2021-01-06 22:36:31.919 UTC [39161] LOG: redo is not required
> 2021-01-06 22:36:31.958 UTC [38787] LOG: database system is ready to
> accept
> connections
> 2021-01-06 22:36:32.561 UTC [38787] LOG: server process (PID 39253) was
> terminated by signal 11: Segmentation fault
> 2021-01-06 22:36:32.561 UTC [38787] DETAIL: Failed process was running:
> do
> $$
> begin
> rollback;
> end
> $$
>
> 2021-01-06 22:36:32.561 UTC [38787] LOG: terminating any other active
> server processes
> 2021-01-06 22:36:32.562 UTC [39252] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:32.562 UTC [39252] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:32.562 UTC [39252] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:32.562 UTC [39167] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:32.562 UTC [39167] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:32.562 UTC [39167] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:32.562 UTC [38787] LOG: all server processes terminated;
> reinitializing
> 2021-01-06 22:36:32.569 UTC [39255] LOG: database system was interrupted;
> last known up at 2021-01-06 22:36:31 UTC
> 2021-01-06 22:36:32.576 UTC [39255] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2021-01-06 22:36:32.583 UTC [39255] LOG: invalid record length at
> 0/16940F8: wanted 24, got 0
> 2021-01-06 22:36:32.583 UTC [39255] LOG: redo is not required
> 2021-01-06 22:36:32.621 UTC [38787] LOG: database system is ready to
> accept
> connections
> s was running: do
> 976] DETAIL: The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> 2021-01-06 22:36:30.686 UTC [38976] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:30.688 UTC [38787] LOG: all server processes terminated;
> reinitializing
> 2021-01-06 22:36:30.695 UTC [38979] LOG: database system was interrupted;
> last known up at 2021-01-06 22:36:30 UTC
> 2021-01-06 22:36:30.702 UTC [38979] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2021-01-06 22:36:30.709 UTC [38979] LOG: invalid record length at
> 0/1693FF0: wanted 24, got 0
> 2021-01-06 22:36:30.709 UTC [38979] LOG: redo is not required
> 2021-01-06 22:36:30.748 UTC [38787] LOG: database system is ready to
> accept
> connections
> 2021-01-06 22:36:31.894 UTC [38787] LOG: server process (PID 39159) was
> terminated by signal 11: Segmentation fault
> 2021-01-06 22:36:31.894 UTC [38787] DETAIL: Failed process was running:
> do
> $$
> begin
> rollback;
> end
> $$
>
> 2021-01-06 22:36:31.894 UTC [38787] LOG: terminating any other active
> server processes
> 2021-01-06 22:36:31.894 UTC [39158] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:31.894 UTC [39158] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:31.894 UTC [39158] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:31.894 UTC [38986] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:31.894 UTC [38986] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2021-01-06 22:36:31.894 UTC [38986] HINT: In a moment you should be
> able to
> reconnect to the database and repeat your command.
> 2021-01-06 22:36:31.895 UTC [38787] LOG: all server processes terminated;
> reinitializing
> 2021-01-06 22:36:31.905 UTC [39161] LOG: database system was interrupted;
> last known up at 2021-01-06 22:36:30 UTC
> 2021-01-06 22:36:31.912 UTC [39161] LOG: database system was not properly
> shut down; automatic recovery in progress
> 2021-01-06 22:36:31.919 UTC [39161] LOG: invalid record length at
> 0/1694080: wanted 24, got 0
> 2021-01-06 22:36:31.919 UTC [39161] LOG: redo is not required
> 2021-01-06 22:36:31.958 UTC [38787] LOG: database system is ready to
> accept
> connections
> 2021-01-06 22:36:32.561 UTC [38787] LOG: server process (PID 39253) was
> terminated by signal 11: Segmentation fault
> 2021-01-06 22:36:32.561 UTC [38787] DETAIL: Failed process was running:
> do
> $$
> begin
> rollback;
> end
> $$
>
> 2021-01-06 22:36:32.561 UTC [38787] LOG: terminating any other active
> server processes
> 2021-01-06 22:36:32.562 UTC [39252] WARNING: terminating connection
> because
> of crash of another server process
> 2021-01-06 22:36:32.562 UTC [39252] DETAIL: The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server procs: command not found
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-01-18 18:05:50 BUG #16829: Postgresql-12 service is not starting with SSL enabled in centos 7
Previous Message Mihir Pandya 2021-01-18 05:13:14 Re: BUG #16822: Unable to launch setup