From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: backend stuck in DataFileExtend |
Date: | 2024-05-07 18:54:16 |
Message-ID: | Zjp42NK4abklf6t5@pryzbyj2023 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote:
> On Tue, May 7, 2024 at 6:21 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > FWIW: both are running zfs-2.2.3 RPMs from zfsonlinux.org.
> ...
> > Yes, they're running centos7 with the indicated kernels.
>
> So far we've got:
>
> * spurious EIO when opening a file (your previous report)
> * hanging with CPU spinning (?) inside pwritev()
> * old kernel, bleeding edge ZFS
>
> From an (uninformed) peek at the ZFS code, if it really is spinning
> there is seems like a pretty low level problem: it's finish the write,
> and now is just trying to release (something like our unpin) and
> unlock the buffers, which involves various code paths that might touch
> various mutexes and spinlocks, and to get stuck like that I guess it's
> either corrupted itself or it is deadlocking against something else,
> but what? Do you see any other processes (including kernel threads)
> with any stuck stacks that might be a deadlock partner?
Sorry, but even after forgetting several times, I finally remembered to
go back to issue, and already rebooted the VM as needed to kill the
stuck process.
But .. it seems to have recurred again this AM. Note that yesterday,
I'd taken the opportunity to upgrade to zfs-2.2.4.
These two procs are the oldest active postgres procs, and also (now)
adjacent in ps -ef --sort start_time.
-[ RECORD 1 ]----+----------------------------------------------------------------
backend_start | 2024-05-07 09:45:06.228528-06
application_name |
xact_start | 2024-05-07 09:55:38.409549-06
query_start | 2024-05-07 09:55:38.409549-06
state_change | 2024-05-07 09:55:38.409549-06
pid | 27449
backend_type | autovacuum worker
wait_event_type | BufferPin
wait_event | BufferPin
state | active
backend_xid |
backend_xmin | 4293757489
query_id |
left | autovacuum: VACUUM ANALYZE child.
-[ RECORD 2 ]----+----------------------------------------------------------------
backend_start | 2024-05-07 09:49:24.686314-06
application_name | MasterMetricsLoader -n -m Xml
xact_start | 2024-05-07 09:50:30.387156-06
query_start | 2024-05-07 09:50:32.744435-06
state_change | 2024-05-07 09:50:32.744436-06
pid | 5051
backend_type | client backend
wait_event_type | IO
wait_event | DataFileExtend
state | active
backend_xid | 4293757489
backend_xmin | 4293757429
query_id | 2230046159014513529
left | PREPARE mml_0 AS INSERT INTO chil
The earlier proc is doing:
strace: Process 27449 attached
epoll_wait(11, ^Cstrace: Process 27449 detached
<detached ...>
The later process is stuck, with:
[pryzbyj(at)telsasoft-centos7 ~]$ sudo cat /proc/5051/stack
[<ffffffffffffffff>] 0xffffffffffffffff
For good measure:
[pryzbyj(at)telsasoft-centos7 ~]$ sudo cat /proc/27433/stack
[<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl]
[<ffffffff9eec5f91>] kthread+0xd1/0xe0
[<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
[pryzbyj(at)telsasoft-centos7 ~]$ sudo cat /proc/27434/stack
[<ffffffffc0600c2e>] taskq_thread+0x48e/0x4e0 [spl]
[<ffffffff9eec5f91>] kthread+0xd1/0xe0
[<ffffffff9f599df7>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
[pryzbyj(at)telsasoft-centos7 ~]$ ps -O wchan================ 5051 27449
PID =============== S TTY TIME COMMAND
5051 ? R ? 02:14:27 postgres: telsasoft ts ::1(53708) INSERT
27449 ep_poll S ? 00:05:16 postgres: autovacuum worker ts
The interesting procs might be:
ps -eO wchan===============,lstart --sort start_time
...
15632 worker_thread Mon May 6 23:51:34 2024 S ? 00:00:00 [kworker/2:2H]
27433 taskq_thread Tue May 7 09:35:59 2024 S ? 00:00:56 [z_wr_iss]
27434 taskq_thread Tue May 7 09:35:59 2024 S ? 00:00:57 [z_wr_iss]
27449 ep_poll Tue May 7 09:45:05 2024 S ? 00:05:16 postgres: autovacuum worker ts
5051 ? Tue May 7 09:49:23 2024 R ? 02:23:04 postgres: telsasoft ts ::1(53708) INSERT
7861 ep_poll Tue May 7 09:51:25 2024 S ? 00:03:04 /usr/local/bin/python3.8 -u /home/telsasoft/server/alarms/core/pr...
7912 ep_poll Tue May 7 09:51:27 2024 S ? 00:00:00 postgres: telsasoft ts ::1(53794) idle
24518 futex_wait_que Tue May 7 10:42:56 2024 S ? 00:00:55 java -jar /home/telsasoft/server/alarms/alcatel_lucent/jms/jms2rm...
...
> While looking around for reported issues I found your abandoned report
> against an older ZFS version from a few years ago, same old Linux
> version:
>
> https://github.com/openzfs/zfs/issues/11641
>
> I don't know enough to say anything useful about that but it certainly
> smells similar...
Wow - I'd completely forgotten about that problem report.
The symptoms are the same, even with a zfs version 3+ years newer.
I wish the ZFS people would do more with their problem reports.
BTW, we'll be upgrading this VM to a newer kernel, if not a newer OS
(for some reason, these projects take a very long time). With any luck,
it'll either recur, or not.
I'm not sure if any of that is useful, or interesting.
--
Justin
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2024-05-07 19:02:01 | Re: pg_sequence_last_value() for unlogged sequences on standbys |
Previous Message | Noah Misch | 2024-05-07 18:53:10 | Re: Weird test mixup |