Re: backend stuck in DataFileExtend

From: Andres Freund <andres(at)anarazel(dot)de>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: backend stuck in DataFileExtend
Date: 2024-05-06 17:51:08
Message-ID: 20240506175108.pcre3ohvlfi2hmdu@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-05-06 12:37:26 -0500, Justin Pryzby wrote:
> On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> > > In March, I noticed that a backend got stuck overnight doing:
> > >
> > > backend_start | 2024-03-27 22:34:12.774195-07
> > > xact_start | 2024-03-27 22:34:39.741933-07
> > > query_start | 2024-03-27 22:34:41.997276-07
> > > state_change | 2024-03-27 22:34:41.997307-07
> > > wait_event_type | IO
> > > wait_event | DataFileExtend
> > > state | active
> > > backend_xid | 330896991
> > > backend_xmin | 330896991
> > > query_id | -3255287420265634540
> > > query | PREPARE mml_1 AS INSERT INTO child.un...
> > >
> > > The backend was spinning at 100% CPU:
> > >
> > > [pryzbyj(at)telsa2021 ~]$ ps -O wchan,pcpu 7881
> > > PID WCHAN %CPU S TTY TIME COMMAND
> > > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT
> > >
> > > This was postgres 16 STABLE compiled at 14e991db8.
> > >
> > > It's possible that this is a rare issue that we haven't hit before.
> > > It's also possible this this is a recent regression. We previously
> > > compiled at b2c9936a7 without hitting any issue (possibly by chance).
> > >
> > > I could neither strace the process nor attach a debugger. They got
> > > stuck attaching. Maybe it's possible there's a kernel issue. This is a
> > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.
> >
> > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> > > 229 3088448
> > >
> > > When I tried to shut down postgres (hoping to finally be able to attach
> > > a debugger), instead it got stuck:
> >
> > That strongly indicates either a kernel bug or storage having an issue. It
> > can't be postgres' fault if an IO never completes.
>
> Is that for sure even though wchan=? (which I take to mean "not in a system
> call"), and the process is stuck in user mode ?

Postgres doesn't do anything to prevent a debugger from working, so this is
just indicative that the kernel is stuck somewhere that it didn't set up
information about being blocked - because it's busy doing something.

> > What do /proc/$pid/stack say?
>
> [pryzbyj(at)telsasoft-centos7 ~]$ sudo cat /proc/18468/stat
> 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10 18446744073709551615 4194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0 17 3 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0

stack, not stat...

> > > Full disclosure: the VM that hit this issue today has had storage-level
> > > errors (reported here at ZZqr_GTaHyuW7fLp(at)pryzbyj2023), as recently as 3
> > > days ago.
> >
> > So indeed, my suspicion from above is confirmed.
>
> I'd be fine with that conclusion (as in the earlier thread), except this
> has now happened on 2 different VMs, and the first one has no I/O
> issues. If this were another symptom of a storage failure, and hadn't
> previously happened on another VM, I wouldn't be re-reporting it.

Is it the same VM hosting environment? And the same old distro?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-05-06 17:51:30 Re: On disable_cost
Previous Message Jonathan S. Katz 2024-05-06 17:44:05 2024-05-09 release announcement draft