backend stuck in DataFileExtend

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Andres Freund <andres(at)anarazel(dot)de>
Subject: backend stuck in DataFileExtend
Date: 2024-05-06 14:05:38
Message-ID: ZjjjssKamHyWpxLT@pryzbyj2023
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In March, I noticed that a backend got stuck overnight doing:

backend_start | 2024-03-27 22:34:12.774195-07
xact_start | 2024-03-27 22:34:39.741933-07
query_start | 2024-03-27 22:34:41.997276-07
state_change | 2024-03-27 22:34:41.997307-07
wait_event_type | IO
wait_event | DataFileExtend
state | active
backend_xid | 330896991
backend_xmin | 330896991
query_id | -3255287420265634540
query | PREPARE mml_1 AS INSERT INTO child.un...

The backend was spinning at 100% CPU:

[pryzbyj(at)telsa2021 ~]$ ps -O wchan,pcpu 7881
PID WCHAN %CPU S TTY TIME COMMAND
7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT

This was postgres 16 STABLE compiled at 14e991db8.

It's possible that this is a rare issue that we haven't hit before.
It's also possible this this is a recent regression. We previously
compiled at b2c9936a7 without hitting any issue (possibly by chance).

I could neither strace the process nor attach a debugger. They got
stuck attaching. Maybe it's possible there's a kernel issue. This is a
VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.

$ awk '{print $14, $15}' /proc/7881/stat # usr / sys
229 3088448

When I tried to shut down postgres (hoping to finally be able to attach
a debugger), instead it got stuck:

$ ps -fu postgres
UID PID PPID C STIME TTY TIME CMD
postgres 7881 119674 99 mar27 ? 08:38:06 postgres: telsasoft ts [local] INSERT
postgres 119674 1 0 mar25 ? 00:07:13 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
postgres 119676 119674 0 mar25 ? 00:00:11 postgres: logger
postgres 119679 119674 0 mar25 ? 00:11:56 postgres: checkpointer

This first occurred on Mar 27, but I see today that it's now recurring
at a different customer:

backend_start | 2024-05-05 22:19:17.009477-06
xact_start | 2024-05-05 22:20:18.129305-06
query_start | 2024-05-05 22:20:19.409555-06
state_change | 2024-05-05 22:20:19.409556-06
pid | 18468
wait_event_type | IO
wait_event | DataFileExtend
state | active
backend_xid | 4236249136
backend_xmin | 4236221661
query_id | 2601062835886299431
left | PREPARE mml_1 AS INSERT INTO chil

This time it's running v16.2 (REL_16_STABLE compiled at b78fa8547),
under centos7 / 3.10.0-1160.66.1.el7.x86_64.

The symptoms are the same: backend stuck using 100% CPU in user mode:
[pryzbyj(at)telsasoft-centos7 ~]$ awk '{print $14, $15}' /proc/18468/stat # usr / sys
364 2541168

This seems to have affected other backends, which are now waiting on
RegisterSyncRequest, frozenid, and CheckpointerComm.

In both cases, the backend got stuck after 10pm, which is when a backup
job kicks off, followed by other DB maintenance. Our backup job uses
pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump
would've finished and snapshot closed at 2023-05-05 22:15. The backup
job did some more pg_dumps involving historic tables without snapshots
and finished at 01:11:40, at which point a reindex job started, but it
looks like the DB was already stuck for the purpose of reindex, and so
the script ended after a handful of commands were "[canceled] due to
statement timeout".

Full disclosure: the VM that hit this issue today has had storage-level
errors (reported here at ZZqr_GTaHyuW7fLp(at)pryzbyj2023), as recently as 3
days ago.

Maybe more importantly, this VM also seems to suffer from some memory
leak, and the leaky process was Killed shortly after the stuck query
started. This makes me suspect a race condition which was triggered
while swapping:

May 5 22:24:05 localhost kernel: Out of memory: Kill process 17157 (python3.8) score 134 or sacrifice child

We don't have as good logs from March, but I'm not aware of killed
processes on the VM where we hit this in March, but it's true that the
I/O there is not fast.

Also, I fibbed when I said these were compiled at 16 STABLE - I'd
backpatched a small number of patches from master:

a97bbe1f1df Reduce branches in heapgetpage()'s per-tuple loop
98f320eb2ef Increase default vacuum_buffer_usage_limit to 2MB.
44086b09753 Preliminary refactor of heap scanning functions
959b38d770b Invent --transaction-size option for pg_restore.
a45c78e3284 Rearrange pg_dump's handling of large objects for better efficiency.
9d1a5354f58 Fix costing bug in MergeAppend
a5cf808be55 Read include/exclude commands for dump/restore from file
8c16ad3b432 Allow using syncfs() in frontend utilities.
cccc6cdeb32 Add support for syncfs() in frontend support functions.
3ed19567198 Make enum for sync methods available to frontend code.
f39b265808b Move PG_TEMP_FILE* macros to file_utils.h.
a14354cac0e Add GUC parameter "huge_pages_status"

I will need to restart services this morning, but if someone wants to
suggest diagnostic measures, I will see whether the command gets stuck
or not.

--
Justin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sriram RK 2024-05-06 14:12:22 Re: AIX support
Previous Message Dmitry Dolgov 2024-05-06 14:01:33 Re: partitioning and identity column