BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: exclusion(at)gmail(dot)com
Subject: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
Date: 2023-05-26 11:00:01
Message-ID: 17947-b9554521ad963c9c@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17947
Logged by: Alexander Lakhin
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 15.3
Operating system: Ubuntu 22.04
Description:

The following script:
numclients=10
for ((c=1;c<=numclients;c++)); do
(
echo "
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding', false);
SELECT pg_sleep(0.05);
SELECT pg_drop_replication_slot('regression_slot');

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding', false);
SELECT pg_sleep(0.05);
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2',
'test_decoding', true);
SELECT pg_drop_replication_slot('regression_slot');

CREATE TABLE tbl_$c(id int);
" | psql >psql-$c.log
) &
done
wait

(with
wal_level = logical
in postgresql.conf)

leads to an assertion failure (up to 10 runs required in my environment):
2023-05-26 13:08:09.243 MSK [2787598] ERROR: can only drop stats once
TRAP:
FailedAssertion("!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock,
DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))", File: "dshash.c", Line:
400, PID: 2787598)

Core was generated by `postgres: law regression [local] CREATE TABLE
'.
Program terminated with signal SIGABRT, Aborted.

#0 __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140618617759552) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140618617759552) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140618617759552) at
./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140618617759552, signo=signo(at)entry=6) at
./nptl/pthread_kill.c:89
#3 0x00007fe45210e476 in __GI_raise (sig=sig(at)entry=6) at
../sysdeps/posix/raise.c:26
#4 0x00007fe4520f47f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x0000555f13dad3e8 in ExceptionalCondition (
conditionName=conditionName(at)entry=0x555f13f06090
"!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock,
DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))",
errorType=errorType(at)entry=0x555f13e0b00b "FailedAssertion",
fileName=fileName(at)entry=0x555f13f05f6c "dshash.c",
lineNumber=lineNumber(at)entry=400) at assert.c:69
#6 0x0000555f13b25896 in dshash_find (hash_table=0x555f15b6b668,
key=key(at)entry=0x7fff8481e240,
exclusive=exclusive(at)entry=false) at dshash.c:400
#7 0x0000555f13c93ea6 in pgstat_get_entry_ref
(kind=kind(at)entry=PGSTAT_KIND_RELATION, dboid=dboid(at)entry=16384,
objoid=objoid(at)entry=3467, create=create(at)entry=true,
created_entry=created_entry(at)entry=0x0) at pgstat_shmem.c:439
#8 0x0000555f13c8edd9 in pgstat_prep_pending_entry
(kind=kind(at)entry=PGSTAT_KIND_RELATION, dboid=16384,
objoid=objoid(at)entry=3467, created_entry=created_entry(at)entry=0x0) at
pgstat.c:1094
#9 0x0000555f13c91599 in pgstat_prep_relation_pending (isshared=false,
rel_id=3467) at pgstat_relation.c:855
#10 pgstat_assoc_relation (rel=rel(at)entry=0x7fe446392740) at
pgstat_relation.c:138
#11 0x0000555f1398ff8e in _bt_first (scan=scan(at)entry=0x555f15c3b5e0,
dir=dir(at)entry=ForwardScanDirection)
at nbtsearch.c:882
#12 0x0000555f13989e46 in btgettuple (scan=0x555f15c3b5e0,
dir=ForwardScanDirection) at nbtree.c:242
#13 0x0000555f1397c014 in index_getnext_tid (scan=0x555f15c3b5e0,
direction=<optimized out>) at indexam.c:533
#14 0x0000555f1397c28b in index_getnext_slot (scan=0x555f15c3b5e0,
direction=direction(at)entry=ForwardScanDirection,
slot=0x555f15c3b4c8) at indexam.c:625
#15 0x0000555f1397b426 in systable_getnext_ordered
(sysscan=sysscan(at)entry=0x555f15c3b470,
direction=direction(at)entry=ForwardScanDirection) at genam.c:715
#16 0x0000555f13d95e25 in BuildEventTriggerCache () at evtcache.c:151
#17 EventCacheLookup (event=<optimized out>, event(at)entry=EVT_SQLDrop) at
evtcache.c:69
#18 0x0000555f13a72e32 in trackDroppedObjectsNeeded () at
event_trigger.c:1147
#19 0x0000555f13a72ea0 in EventTriggerBeginCompleteQuery () at
event_trigger.c:1089
#20 0x0000555f13c7c510 in ProcessUtilitySlow (pstate=0x555f15b91540,
pstmt=0x555f15b70440,
queryString=0x555f15b6f410 "CREATE TABLE tbl_4(id int);",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, qc=0x7fff8481fe10, dest=<optimized out>) at
utility.c:1114
#21 0x0000555f13c7bb53 in standard_ProcessUtility (pstmt=0x555f15b70440,
queryString=0x555f15b6f410 "CREATE TABLE tbl_4(id int);",
readOnlyTree=<optimized out>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x555f15b70530, qc=0x7fff8481fe10)
at utility.c:1074
#22 0x0000555f13c7a201 in PortalRunUtility
(portal=portal(at)entry=0x555f15bdbbc0, pstmt=pstmt(at)entry=0x555f15b70440,
isTopLevel=isTopLevel(at)entry=true,
setHoldSnapshot=setHoldSnapshot(at)entry=false, dest=dest(at)entry=0x555f15b70530,

qc=qc(at)entry=0x7fff8481fe10) at pquery.c:1158
#23 0x0000555f13c7a33d in PortalRunMulti
(portal=portal(at)entry=0x555f15bdbbc0, isTopLevel=isTopLevel(at)entry=true,
setHoldSnapshot=setHoldSnapshot(at)entry=false,
dest=dest(at)entry=0x555f15b70530,
altdest=altdest(at)entry=0x555f15b70530, qc=qc(at)entry=0x7fff8481fe10) at
pquery.c:1315
#24 0x0000555f13c7a9e1 in PortalRun (portal=portal(at)entry=0x555f15bdbbc0,
count=count(at)entry=9223372036854775807,
isTopLevel=isTopLevel(at)entry=true, run_once=run_once(at)entry=true,
dest=dest(at)entry=0x555f15b70530,
altdest=altdest(at)entry=0x555f15b70530, qc=0x7fff8481fe10) at
pquery.c:791
#25 0x0000555f13c766bd in exec_simple_query (query_string=0x555f15b6f410
"CREATE TABLE tbl_4(id int);")
at postgres.c:1250
#26 0x0000555f13c78282 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4593
#27 0x0000555f13be2b42 in BackendRun (port=0x555f15b930a0,
port=0x555f15b930a0) at postmaster.c:4511
#28 BackendStartup (port=0x555f15b930a0) at postmaster.c:4239
#29 ServerLoop () at postmaster.c:1806
#30 0x0000555f13be3beb in PostmasterMain (argc=argc(at)entry=3,
argv=argv(at)entry=0x555f15b69670) at postmaster.c:1478
#31 0x0000555f13908cff in main (argc=3, argv=0x555f15b69670) at main.c:202

(Initially, I'd got this exception when playing with a concurrent
installcheck for contrib/test_decoding [1].)

With some debug logging added, I see that this failure is a consequence of
the following combination of circumstances/issues:
First, when a replication slot created, it can just reuse an pre-existing
pgstat hash_entry, with increasing refcount.
ReplicationSlotCreate() -> pgstat_create_replslot() ->
pgstat_get_entry_ref_locked() -> pgstat_get_entry_ref() - >
pgstat_reinit_entry():
/* mark as not dropped anymore */
pg_atomic_fetch_add_u32(&shhashent->refcount, 1);
shhashent->dropped = false;
Later, when that replication slot getting dropped, that hash_entry stays
alive (with refcount > 0), but has dropped = true;
(So, pg_create_logical_replication_slot/pg_drop_logical_replication_slot
can leave the pgstat hash_entry having dropped = true, but refcount > 1.)

Second, when another replication slot created, it can reuse the defective
pgstat_get_entry_ref from the local cache:
ReplicationSlotCreate() -> pgstat_create_replslot() ->
pgstat_get_entry_ref_locked() -> pgstat_get_entry_ref():
/*
* First check the lookup cache hashtable in local memory. If we find
a
* match here we can avoid taking locks / causing contention.
*/
if (pgstat_get_entry_ref_cached(key, &entry_ref))
return entry_ref;
(Thus, a backend can get a new replication slot with an existing pgstat
hash_entry, that has dropped = true.)

Third, when the replication slot with pgstat hash_entry->dropped = true is
dropped again, the error "can only drop stats once" is raised in
pgstat_drop_entry_internal():
/*
* Signal that the entry is dropped - this will eventually cause other
* backends to release their references.
*/
if (shent->dropped)
elog(ERROR, "can only drop stats once; hstat: %p", hstat);
but in this case dshash_release_lock() is not called, so that leads to the
assertion failure, when trying to access pgstat's hash on creating a
relation later.

[1]
https://www.postgresql.org/message-id/7e4d4a80-3e3c-231f-f886-6cada2aa582b%40gmail.com

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-05-26 13:51:09 Re: Comparing date strings with jsonpath expression
Previous Message Kyotaro Horiguchi 2023-05-26 09:02:07 Re: BUG #17942: vacuumdb doesn't populate extended statistics on partitioned tables