BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: exclusion(at)gmail(dot)com
Subject: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Date: 2023-07-19 15:00:01
Message-ID: 18031-ebe2d08cb405f6cc@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18031
Logged by: Alexander Lakhin
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 16beta2
Operating system: Ubuntu 22.04
Description:

The following script:
numclients=100
for ((c=1;c<=$numclients;c++)); do
createdb db$c
done

for ((i=1;i<=50;i++)); do
echo "ITERATION $i"
for ((c=1;c<=$numclients;c++)); do
echo "SELECT format('CREATE TABLE t1_%s_$i (t TEXT);', g) FROM
generate_series(1,10) g \\gexec" | psql db$c >psql1_$c.out 2>&1 &
echo "SELECT format('CREATE TABLE t2_%s_$i (t TEXT);', g) FROM
generate_series(1,10) g \\gexec" | psql db$c >psql2_$c.out 2>&1 &
echo "SELECT 'VACUUM FULL pg_class' FROM generate_series(1,10) g
\\gexec" | psql db$c >psql3_$c.out 2>&1 &
done
wait
grep -C1 'signal 11' server.log && break;
done

when executed with the custom settings:
fsync = off
max_connections = 1000
deadlock_timeout = 100ms
min_parallel_table_scan_size = 1kB

Leads to a server crash:
...
ITERATION 15
2023-07-19 17:01:46.542 MSK [1449673] STATEMENT: VACUUM FULL pg_class
2023-07-19 17:01:46.617 MSK [1434014] LOG: background worker "parallel
worker" (PID 1450094) was terminated by signal 11: Segmentation fault
2023-07-19 17:01:46.617 MSK [1434014] DETAIL: Failed process was running:
VACUUM FULL pg_class

With the following stack trace:
Core was generated by `postgres: parallel worker for PID 1449148 waiting
'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1450094' in core file too small.
#0 RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at
lock.c:1874
1874 LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0 RemoveFromWaitQueue (proc=0x7fbae3910400, hashcode=1106083722) at
lock.c:1874
#1 0x000055dcfc60f595 in LockErrorCleanup () at proc.c:735
#2 0x000055dcfc618757 in ProcessInterrupts () at postgres.c:3207
#3 0x000055dcfc60fbe1 in ProcSleep
(locallock=locallock(at)entry=0x55dcfe964380,
lockMethodTable=lockMethodTable(at)entry=0x55dcfca92800
<default_lockmethod>) at proc.c:1295
#4 0x000055dcfc5ffe0b in WaitOnLock
(locallock=locallock(at)entry=0x55dcfe964380,
owner=owner(at)entry=0x55dcfe978ed8) at lock.c:1818
#5 0x000055dcfc601ba5 in LockAcquireExtended
(locktag=locktag(at)entry=0x7ffd32e28110,
lockmode=lockmode(at)entry=5, sessionLock=sessionLock(at)entry=false,
dontWait=dontWait(at)entry=false,
reportMemoryError=reportMemoryError(at)entry=true,
locallockp=locallockp(at)entry=0x0) at lock.c:1082
#6 0x000055dcfc601c67 in LockAcquire (locktag=locktag(at)entry=0x7ffd32e28110,
lockmode=lockmode(at)entry=5,
sessionLock=sessionLock(at)entry=false, dontWait=dontWait(at)entry=false) at
lock.c:740
#7 0x000055dcfc5fe85f in XactLockTableWait (xid=xid(at)entry=43588,
rel=rel(at)entry=0x7fbad92f4348,
ctid=ctid(at)entry=0x55dcfe9f74b4, oper=oper(at)entry=XLTW_InsertIndexUnique)
at lmgr.c:702
#8 0x000055dcfc2720d2 in heapam_index_build_range_scan
(heapRelation=0x7fbad92f4348,
indexRelation=0x7fbad92f90c8, indexInfo=0x55dcfe9f7138,
allow_sync=<optimized out>, anyvisible=false,
progress=false, start_blockno=0, numblocks=4294967295,
callback=0x55dcfc29aba6 <_bt_build_callback>,
callback_state=0x7ffd32e28610, scan=0x55dcfe9f7458) at
heapam_handler.c:1496
#9 0x000055dcfc29af07 in table_index_build_scan (scan=<optimized out>,
callback_state=0x7ffd32e28610,
callback=0x55dcfc29aba6 <_bt_build_callback>, progress=false,
allow_sync=true,
index_info=0x55dcfe9f7138, index_rel=<optimized out>,
table_rel=<optimized out>)
at ../../../../src/include/access/tableam.h:1781
#10 _bt_parallel_scan_and_sort (btspool=btspool(at)entry=0x55dcfe9f7d08,
btspool2=btspool2(at)entry=0x0,
btshared=btshared(at)entry=0x7fbad93122a0,
sharedsort=sharedsort(at)entry=0x7fbad9312240,
sharedsort2=sharedsort2(at)entry=0x0, sortmem=<optimized out>,
progress=false) at nbtsort.c:1985
#11 0x000055dcfc29d343 in _bt_parallel_build_main (seg=0x55dcfe979608,
toc=0x7fbad9312000) at nbtsort.c:1888
#12 0x000055dcfc2c64a6 in ParallelWorkerMain (main_arg=<optimized out>) at
parallel.c:1520
#13 0x000055dcfc56faf0 in StartBackgroundWorker () at bgworker.c:861
#14 0x000055dcfc57631a in do_start_bgworker (rw=rw(at)entry=0x55dcfe969ad0) at
postmaster.c:5762
#15 0x000055dcfc576496 in maybe_start_bgworkers () at postmaster.c:5986
#16 0x000055dcfc576dbf in process_pm_pmsignal () at postmaster.c:5149
#17 0x000055dcfc57736b in ServerLoop () at postmaster.c:1770
#18 0x000055dcfc5788ef in PostmasterMain (argc=argc(at)entry=3,
argv=argv(at)entry=0x55dcfe938620)
at postmaster.c:1463
#19 0x000055dcfc4964f5 in main (argc=3, argv=0x55dcfe938620) at main.c:198

(I got also another stack trace:
(gdb) bt
#0 RemoveFromWaitQueue (proc=0x7f763dde44c0, hashcode=1514899366) at
lock.c:1874
#1 0x0000559984e4e595 in LockErrorCleanup () at proc.c:735
#2 0x0000559984b125cc in AbortTransaction () at xact.c:2735
#3 0x0000559984b1307d in AbortCurrentTransaction () at xact.c:3338
#4 0x0000559984e5b68a in PostgresMain (dbname=<optimized out>,
username=0x55998640c6d8 "law")
at postgres.c:4354
#5 0x0000559984db3183 in BackendRun (port=port(at)entry=0x559986445a60) at
postmaster.c:4461
#6 0x0000559984db623a in BackendStartup (port=port(at)entry=0x559986445a60) at
postmaster.c:4189
#7 0x0000559984db63d8 in ServerLoop () at postmaster.c:1779
#8 0x0000559984db78ef in PostmasterMain (argc=argc(at)entry=3,
argv=argv(at)entry=0x55998640a620)
at postmaster.c:1463
#9 0x0000559984cd54f5 in main (argc=3, argv=0x55998640a620) at main.c:198
)

server.log also contains:
2023-07-19 17:01:46.542 MSK [1449148] ERROR: deadlock detected
2023-07-19 17:01:46.542 MSK [1449148] DETAIL: Process 1449148 waits for
ShareLock on transaction 43588; blocked by process 1449150.
Process 1449150 waits for RowExclusiveLock on relation 1259 of database
16385; blocked by process 1449148.
Process 1449148: VACUUM FULL pg_class
Process 1449150: CREATE TABLE t2_2_15 (t TEXT);
2023-07-19 17:01:46.542 MSK [1449148] HINT: See server log for query
details.
2023-07-19 17:01:46.542 MSK [1449148] CONTEXT: while checking uniqueness of
tuple (28,39) in relation "pg_class"
2023-07-19 17:01:46.542 MSK [1449148] STATEMENT: VACUUM FULL pg_class

It looks like in LockErrorCleanup():
if (!dlist_node_is_detached(&MyProc->links))
{
/* We could not have been granted the lock yet */
RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
}
MyProc->links happened to be not detached, but MyProc->waitLock is NULL.

`git bisect` for this anomaly points at 5764f611e.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Rowley 2023-07-19 23:13:05 Re: BUG #18030: Large memory consumption caused by in-clause subqueries
Previous Message vignesh C 2023-07-19 11:07:06 Re: BUG #18027: Logical replication taking forever