BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: d_van_rietschoten(at)hotmail(dot)com
Subject: BUG #18586: Process (and transaction) is stuck in IPC when the DB is under high load
Date: 2024-08-21 11:53:29
Message-ID: 18586-03e1535b1b34db81@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18586
Logged by: Dimar van Rietschoten
Email address: d_van_rietschoten(at)hotmail(dot)com
PostgreSQL version: 15.7
Operating system: Unknown (GCP managed)
Description:

We have a small database server (2 cores, 4 gigs). We performed an initial
load, putting quite a bit of load on this database. The initial load is
performed by external kubernetes pods.
After a while, we saw the nr of connection to the database increased, until
we ran out of connections.

We found that one of the processes was stuck in waiting for IPC (for many
hours). This transaction held a lock, causing other transactions to wait on
that lock.
It seems that after a while, the application doing the initial load
abandoned the transaction / connection and created a new one. This new
connection than hit the same lock, with the same outcome.

We could reproduce with a somewhat larger database (4 CPU, 8 gigs). With 6
CPU, 12 gigs we did not run into this problem.

Query used to find the stuck transactions:
select * from pg_activity order by backend_start asc;

Result:
<null> <null> 21 <null> <null> <null>
<Other> <null> <null>
2024-08-16 13:17:02.331128 <null> <null>
<null> Timeout CheckpointWriteDelay
<null> <Other> <Other> <null>






checkpointer
<null> <null> 22 <null> <null> <null>
<Other> <null> <null>
2024-08-16 13:17:02.332034 <null> <null>
<null> Activity BgWriterMain
<null> <Other> <Other> <null>






background writer
<null> <null> 24 <null> <null> <null>
<Other> <null> <null>
2024-08-16 13:17:02.372296 <null> <null>
<null> Activity WalWriterMain
<null> <Other> <Other> <null>






walwriter
<null> <null> 25 <null> <null> <null>
<Other> <null> <null>
2024-08-16 13:17:02.373328 <null> <null>
<null> Activity AutoVacuumMain
<null> <Other> <Other> <null>






autovacuum launcher
<null> <null> 26 <null> <null> <null>
<Other> <null> <null>
2024-08-16 13:17:02.374639 <null> <null>
<null> Activity ArchiverMain
<null> <Other> <Other> <null>






archiver
<null> <null> 27 <null> 10 cloudsqladmin
<Other> <null> <null>
2024-08-16 13:17:02.374752 <null> <null>
<null> Activity LogicalLauncherMain
<null> <Other> <Other> <null>






logical replication launcher
16384 cloudsqladmin 30 <null> 10 cloudsqladmin
cloudsqlagent <Other> <null> 56364
2024-08-16 13:17:02.523891 <null> 2024-08-16
15:28:28.290619 2024-08-16 15:28:28.29065 Client ClientRead
idle <Other> <Other> <null>
;





client backend
16488 iss 140 <null> 16490 <I hid the username>
PostgreSQL JDBC Driver <Other> <null> -1
2024-08-16 13:17:06.768402 2024-08-16 13:27:41.172003 2024-08-16
13:27:41.439554 2024-08-16 13:27:41.439557 IPC ParallelFinish
active <Other> <Other> <null>
select
l1_0.inbound_shipment_id,l1_0.inbound_shipment_line_id,l1_0.ean,lc1_0.id,lc1_0.announcement_type,lc1_0.load_carrier_id,lc1_0.state,l1_0.quantity,l1_0.sku
from inbound_shipment_lines l1_0 left join load_carriers lc1_0 on
lc1_0.id=l1_0.load_carrier_id where l1_0.inbound_shipment_id=$1


client backend
... (there were other queries)

I found that this process (140) was blocking other queries using pg_locks.
These make sense though (this is the application logic).

My question:
Why did PID 140 never complete (we waited ~12 hours)? The worst case
scenario has little (< 20.000 rows). I asked postgres for a query plan for a
similar query, it comes up with:
Gather (cost=1015.66..4488.84 rows=1376 width=104) (actual
time=6.434..14.233 rows=4 loops=1)

Workers Planned: 1


Workers Launched: 1


-> Nested Loop Left Join (cost=15.66..3351.24 rows=809 width=104)
(actual time=2.583..2.587 rows=2 loops=2)

-> Parallel Bitmap Heap Scan on inbound_shipment_lines l1_0
(cost=15.23..1400.35 rows=809 width=62) (actual time=2.574..2.576 rows=2
loops=2)
Recheck Cond: (inbound_shipment_id = 6134742)


Heap Blocks: exact=1


-> Bitmap Index Scan on
inbound_shipment_lines_inbound_shipment_id_idx (cost=0.00..14.88 rows=1376
width=0) (actual time=3.551..3.551 rows=4 loops=1)
Index Cond: (inbound_shipment_id = 6134742)


-> Index Scan using load_carriers_pkey on load_carriers lc1_0
(cost=0.43..2.41 rows=1 width=46) (actual time=0.002..0.002 rows=0 loops=4)

Index Cond: (id = l1_0.load_carrier_id)

This takes very little time to complete (even for the worst case
occurrence).

Although we put (too?) much load on the database, I would not expect the
process to hang. I'd also expect the transaction to be rolled back when the
connection is closed. I tried stopping all clients, but the connections
remained on the database server.
With the 2 cores, 4 gigs system the database CPU usage was constant at 100%.
With 4 cores, 8 gigs, it was constant around 95%. With 6 cores, 12 gigs at
was a little over 60%

I saw some similar posts
(https://postgrespro.com/list/id/0f64b4c7fc200890f2055ce4d6650e9c2191fac2(dot)camel(at)j-davis(dot)com
for instance, or
https://www.postgresql.org/message-id/E9FA92C2921F31408041863B74EE4C2001A479E590%40CCPMAILDAG03.cantab.local)

I understand the issue is a bit vague, but it’s all I’ve got:
The IPC process seems to get stuck under heavy load
The connections / transactions were not closed after the client went away

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-08-21 12:34:37 BUG #18587: PLPGSQL : conflict between variable and ON CONFLICT (a_column)
Previous Message Richard Guo 2024-08-21 07:55:30 Re: CREATE CAST allows creation of binary-coercible cast to range over domain