From: | Ian J Cottee <ian(at)cottee(dot)org> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS |
Date: | 2024-10-30 07:34:03 |
Message-ID: | CAL0m=zXXmKKdh0zNseVmMZ2qfWH-093sToyKgOGCjiPWikz3Xg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello everyone, I’ve been using postgres for over 25 years now and never
had any major issues which were not caused by my own stupidity. In the last
24 hours however I’ve had a number of issues on one client's server which I
assume are a bug in postgres or a possible hardware issue (they are running
on a Linode) but I need some clarification and would welcome advice on how
to proceed. I will also forward this mail to Linode support to ask them to
check for any memory issues they can detect.
This particular Postgres is running on Ubuntu LTS 22.04 and has the
following version information:
```
PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
```
The quick summary is that over a 24 hour period I had the following errors
appear in the postgres logs at different times causing the system processes
to restart:
- stuck spinlock detected
- free(): corrupted unsorted chunks
- double free or corruption (!prev)
- corrupted size vs. prev_size
- corrupted double-linked list
- *** stack smashing detected ***: terminated
- Segmentation fault
Here’s the more detailed breakdown.
On Monday evening this week, the following event occurred on the server
```
2024-10-28 18:12:47.145 GMT [575437] xxx(at)xxx PANIC: stuck spinlock detected
at LWLockWaitListLock, ./build/../src/backend/storage/lmgr/lwlock.c:913
```
Followed by:
```
2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active
server processes
2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes terminated;
reinitializing
```
And eventually
```
2024-10-28 18:12:48.474 GMT [575566] xxx(at)xxx FATAL: the database system is
in recovery mode
2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not properly
shut down; automatic recovery in progress
2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8
2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at
DD/405EF818: wanted 24, got 0
2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to
accept connections
```
This wasn’t noticed by myself or any users as they tend to all be finished
by 17:30. However later,
```
2024-10-28 20:27:15.258 GMT [611459] xxx(at)xxx LOG: unexpected EOF on client
connection with an open transaction
2024-10-28 21:01:05.934 GMT [620373] xxx(at)xxxx LOG: unexpected EOF on client
connection with an open transaction
free(): corrupted unsorted chunks
2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803) was
terminated by signal 6: Aborted
2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active
server processes
```
This time it could not recover and I didn’t notice until early the next
morning whilst doing some routine checks.
```
2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not properly
shut down; automatic recovery in progress
2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740
2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at
DD/475452A0: wanted 24, got 0
2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-10-28 21:15:03.682 GMT [623829] xxx(at)xxx FATAL: the database system is
in recovery mode
double free or corruption (!prev)
2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807) was
terminated by signal 6: Aborted
2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to startup
process failure
2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down
```
When I noticed in the morning it was able to start without an issue. From
googling it appeared to be a memory issue and I wondered if the problem was
sorted now the server process had stopped completely and restarted. The
problem was not sorted although all the above errors were recovered from
automatically without any input from myself or the client’s noticing.
```
corrupted size vs. prev_size
2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel
worker" (PID 947642) was terminated by signal 6: Aborted
```
```
corrupted double-linked list
2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel
worker" (PID 1019071) was terminated by signal 6: Aborted
```
```
*** stack smashing detected ***: terminated
2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel
worker" (PID 528630) was terminated by signal 6: A\ borted
```
```
2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel
worker" (PID 533515) was terminated by signal 11: \
Segmentation fault
2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running:
SELECT "formula_line".id FROM "formul\
```
I rebooted the server at 18:30 and have had no further issues so far,
although work has yet to start. When rebooting the server, postgres seemed
to take a long time to terminate.
Now there is one odd thing that has been happening recently. Due to a bug
in my code I've had more deadlocks than would normally be expected.
```
2024-10-29 19:26:51.680 GMT [71152] xxx(at)xxx ERROR: could not serialize
access due to concurrent update
```
I believe I have fixed that bug in my code this morning and the errors
above did not seem to coincide with the errors appearing but I'm raising it
in case related.
Comments and insights are warmly welcomed.
Best regards
Ian Cottee
From | Date | Subject | |
---|---|---|---|
Next Message | Vijaykumar Jain | 2024-10-30 11:39:30 | Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS |
Previous Message | Daniel Westermann (DWE) | 2024-10-30 07:10:15 | Re: Delays between "connection received" and "connection authenticated" because of localhost entries in hba |