Question regarding ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME in dshash_detach()

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>
Subject: Question regarding ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME in dshash_detach()
Date: 2022-08-23 06:58:48
Message-ID: CABOikdMzogyfrPLQCNyZkRwX5fR_2-aQVFDeqAg2N3=FhXDfNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

One of my tests hit an assertion in dshash_detach(). Once again this is
with BDR and I don't have a reproduction case with standalone PG. Also,
this probably happened because of some weirdness in systemd where it
removes shared memory segments underneath, resulting in ERRORs being thrown.

However, looking at the stack trace and the code, I wonder if it's possible
to hit the assertion even with stock postgres. In my case, the stack trace
looked like:

```
(gdb) bt
#0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fa5775b9535 in __GI_abort () at abort.c:79
#2 0x0000556dbce828bc in ExceptionalCondition
(conditionName=0x556dbd027c88
"!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock,
DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))",
errorType=0x556dbd027c44 "FailedAssertion", fileName=0x556dbd027c10
"/opt/postgres/src/postgres/src/backend/lib/dshash.c", lineNumber=309)
at /opt/postgres/src/postgres/src/backend/utils/error/assert.c:69
#3 0x0000556dbcae0aae in dshash_detach (hash_table=0x556dbe0294f0) at
/opt/postgres/src/postgres/src/backend/lib/dshash.c:309
#4 0x0000556dbcd045bf in pgstat_detach_shmem () at
/opt/postgres/src/postgres/src/backend/utils/activity/pgstat_shmem.c:240
#5 0x0000556dbccfd263 in pgstat_shutdown_hook (code=0, arg=0) at
/opt/postgres/src/postgres/src/backend/utils/activity/pgstat.c:509
#6 0x0000556dbcca18b1 in shmem_exit (code=0) at
/opt/postgres/src/postgres/src/backend/storage/ipc/ipc.c:239
#7 0x0000556dbcca1769 in proc_exit_prepare (code=0) at
/opt/postgres/src/postgres/src/backend/storage/ipc/ipc.c:194
#8 0x0000556dbcca16ba in proc_exit (code=0) at
/opt/postgres/src/postgres/src/backend/storage/ipc/ipc.c:107
#9 0x0000556dbcbfcadc in AutoVacWorkerMain (argc=0, argv=0x0) at
/opt/postgres/src/postgres/src/backend/postmaster/autovacuum.c:1590
#10 0x0000556dbcbfc968 in StartAutoVacWorker () at
/opt/postgres/src/postgres/src/backend/postmaster/autovacuum.c:1496
#11 0x0000556dbcc0aa50 in StartAutovacuumWorker () at
/opt/postgres/src/postgres/src/backend/postmaster/postmaster.c:5534
#12 0x0000556dbcc0a56b in sigusr1_handler (postgres_signal_arg=10) at
/opt/postgres/src/postgres/src/backend/postmaster/postmaster.c:5239
#13 <signal handler called>
#14 0x00007fa577687a27 in __GI___select (nfds=10, readfds=0x7fff6e69a370,
writefds=0x0, exceptfds=0x0, timeout=0x7fff6e69a3f0) at
../sysdeps/unix/sysv/linux/select.c:41
#15 0x0000556dbcc05e7f in ServerLoop () at
/opt/postgres/src/postgres/src/backend/postmaster/postmaster.c:1770
#16 0x0000556dbcc0581e in PostmasterMain (argc=5, argv=0x556dbe027490) at
/opt/postgres/src/postgres/src/backend/postmaster/postmaster.c:1478
#17 0x0000556dbcafcaf1 in main (argc=5, argv=0x556dbe027490) at
/opt/postgres/src/postgres/src/backend/main/main.c:202
```

If the autovacuum worker is not inside a transaction and throws an ERROR
while holding a lock on the dshash, AFAICS it can hit proc_exit() without
releasing the lock (because there is no abort transaction processing)

For example, at autovaccum.c:1694 pgstat_report_autovac() can
theoretically deep down call `dsa_get_address()`, which calls
`get_segment_by_index()` and that function has couple of elog(ERROR) calls.

I understand that this ERROR path is probably not likely to hit during
normal course, but if it does like in my case, then it will result in
assertion failure. I also think a similar problem may have happened in
older releases (not the assertion failure, but backends exiting with a
LWLock still held), but maybe the likelihood was very small before.

If this is a problem worth addressing, I wonder if we should explicitly
release all LWLocks in the long jump handler, like we do for other
processes.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB: https://www.enterprisedb..com

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2022-08-23 07:48:44 Re: SQL/JSON features for v15
Previous Message John Naylor 2022-08-23 06:13:13 Re: Considering additional sort specialisation functions for PG16