Quick Links

Re: backend stuck in DataFileExtend

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: backend stuck in DataFileExtend
Date:	2024-05-06 17:04:13
Message-ID:	20240506170413.u7rybdg7nmgrizfu@awork3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> In March, I noticed that a backend got stuck overnight doing:
>
> backend_start | 2024-03-27 22:34:12.774195-07
> xact_start | 2024-03-27 22:34:39.741933-07
> query_start | 2024-03-27 22:34:41.997276-07
> state_change | 2024-03-27 22:34:41.997307-07
> wait_event_type | IO
> wait_event | DataFileExtend
> state | active
> backend_xid | 330896991
> backend_xmin | 330896991
> query_id | -3255287420265634540
> query | PREPARE mml_1 AS INSERT INTO child.un...
>
> The backend was spinning at 100% CPU:
>
> [pryzbyj(at)telsa2021 ~]$ ps -O wchan,pcpu 7881
> PID WCHAN %CPU S TTY TIME COMMAND
> 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT
>
> This was postgres 16 STABLE compiled at 14e991db8.
>
> It's possible that this is a rare issue that we haven't hit before.
> It's also possible this this is a recent regression. We previously
> compiled at b2c9936a7 without hitting any issue (possibly by chance).
>
> I could neither strace the process nor attach a debugger. They got
> stuck attaching. Maybe it's possible there's a kernel issue. This is a
> VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.

> $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> 229 3088448
>
> When I tried to shut down postgres (hoping to finally be able to attach
> a debugger), instead it got stuck:

That strongly indicates either a kernel bug or storage having an issue. It
can't be postgres' fault if an IO never completes.

What do /proc/$pid/stack say?

> In both cases, the backend got stuck after 10pm, which is when a backup
> job kicks off, followed by other DB maintenance. Our backup job uses
> pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump
> would've finished and snapshot closed at 2023-05-05 22:15. The backup
> job did some more pg_dumps involving historic tables without snapshots
> and finished at 01:11:40, at which point a reindex job started, but it
> looks like the DB was already stuck for the purpose of reindex, and so
> the script ended after a handful of commands were "[canceled] due to
> statement timeout".

Is it possible that you're "just" waiting for very slow IO? Is there a lot of
dirty memory? Particularly on these old kernels that can lead to very extreme
delays.

grep -Ei 'dirty|writeback' /proc/meminfo

> [...]
> Full disclosure: the VM that hit this issue today has had storage-level
> errors (reported here at ZZqr_GTaHyuW7fLp(at)pryzbyj2023), as recently as 3
> days ago.

So indeed, my suspicion from above is confirmed.

Greetings,

Andres Freund

In response to

backend stuck in DataFileExtend at 2024-05-06 14:05:38 from Justin Pryzby

Responses

Re: backend stuck in DataFileExtend at 2024-05-06 17:37:26 from Justin Pryzby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2024-05-06 17:08:35	Re: Removing unneeded self joins
Previous Message	Justin Pryzby	2024-05-06 16:41:22	Re: pg17 issues with not-null contraints