soft lockup - CPU#16 stuck for 3124s! [postmaster:2273]

From: Matthias Apitz <guru(at)unixarea(dot)de>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: soft lockup - CPU#16 stuck for 3124s! [postmaster:2273]
Date: 2024-03-22 17:12:28
Message-ID: Zf27/PpLpsE0fB+i@pureos
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


We have a PostgreSQL 15.1 server in production at a customer for some
weeks (migrated from an older version) on SuSE SLES 15.

The customer is facing machine locks and before the Linux server does
not respond any more (not even on SSH, only power-cycle reset helps to
get it back), short before the fault a lot of messages are in
/var/log/messages of the content:

# grep watchdog: /var/log/messages
...
2024-03-22T13:11:32.056154+01:00 sunrise kernel: [327844.313048][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for 3069s! [migration/25:166]
2024-03-22T13:12:28.056244+01:00 sunrise kernel: [327900.310267][ C16] watchdog: BUG: soft lockup - CPU#16 stuck for 3124s! [postmaster:2273]
2024-03-22T13:12:28.056340+01:00 sunrise kernel: [327900.311052][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for 3121s! [migration/25:166]

Not all related to postmaster, but some of them. The server is in
principle idle, has a lot of CPUs and 32 GByte memory. To the PostgreSQL
server connect around 100 PostgreSQL clients, most of them by ESQL/C and
on localhost.

Looking around, I detected today that the WAL archiving was configured
wrong, leading to messages like (sorry for the German, but you will get
the meaning):

2024-03-22 13:11:50.838 CET [2630] LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1
2024-03-22 13:11:50.838 CET [2630] DETAIL: Der fehlgeschlagene Archivbefehl war: test ! -f /data/postgresql151/wal_archive/000000010000000000000001 && cp pg_wal/000000010000000000000001 /data/postgresql151/wal_archive/000000010000000000000001
cp: reguläre Datei '/data/postgresql151/wal_archive/000000010000000000000001' kann nicht angelegt werden: Datei oder Verzeichnis nicht gefunden
2024-03-22 13:11:51.842 CET [2630] LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1
2024-03-22 13:11:51.842 CET [2630] DETAIL: Der fehlgeschlagene Archivbefehl war: test ! -f /data/postgresql151/wal_archive/000000010000000000000001 && cp pg_wal/000000010000000000000001 /data/postgresql151/wal_archive/000000010000000000000001

The problem was that the directory /data/postgresql151/wal_archive was just
not created (and this for two weeks in production). Since it is now created
and also the backup of the WAL from there is in place, the problem of
the locks went away.

Any chance that the problem of the Pos server not being able to copy the
WALs could have caused the locks? Just to make sure that we hit the beast.

matthias

--
Matthias Apitz, ✉ guru(at)unixarea(dot)de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2024-03-22 17:27:15 Re: soft lockup - CPU#16 stuck for 3124s! [postmaster:2273]
Previous Message Nathan Bossart 2024-03-22 16:54:48 Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs