PostgreSQL container crash trouble.

From: Daisuke Ikeda <dai(dot)ikd123(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: PostgreSQL container crash trouble.
Date: 2021-03-08 13:30:04
Message-ID: CAJ6DU3HiJ5FHbqPua19jAD=wLgiXBTjuHfbmv1jCOaNOpB3cCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear all.

I'm running PostgreSQL (enabled timescaledb extension) in the following
environment.
And I had db pod crashed error several times irregularly.

I want to know the cause of this fail.

Environment:

- Base: Kubernetes(microk8s)
- DB: PostgreSQL 12.4 (TimescaleDB) Container
- DB OS: Alpine Linux
- microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts)
- DB data location: Mounted host directory (for data persistence)

1) PostgreSQL crash and cannot start Pod..

I noticed that the timescaledb pod was restarted repeatedly with the
following error.

---
PostgreSQL Database directory appears to contain a database; Skipping
initialization
[1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by gcc
(Alpine 9.3.0) 9.3.0, 64-bit
[1] LOG: listening on IPv4 address "0.0.0.0", port 5432
[1] LOG: listening on IPv6 address "::", port 5432
[1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
[20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC
[20] LOG: invalid primary checkpoint record
[20] PANIC: could not locate a valid checkpoint record
[1] LOG: startup process (PID 20) was terminated by signal 6: Aborted
[1] LOG: aborting startup due to startup process failure
[1] LOG: database system is shut down
---

I thought it might be WAL trouble and tried to check with pg_controldata
and pg_waldump.

---
$ pg_controldata <PostgreSQL data dir>
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6909006008117780509
Database cluster state: in production
pg_control last modified: Mon Feb 15 21:11:21 2021
Latest checkpoint location: 15/8FB002C0
Latest checkpoint's REDO location: 15/8FAF3018
Latest checkpoint's REDO WAL file: 00000001000000150000008F
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:33988723
Latest checkpoint's NextOID: 117637
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 33988723
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon Feb 15 21:11:14 2021
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 50
max_worker_processes setting: 15
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce:
871e705d9393cdb1b161d5562656e20db314779af25942fa10d840983dc4ba84
---

I checked the latest checkpoint's REDO WAL file.

---
$ pg_waldump -n 10 pg_wal/00000001000000150000008F
pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and
1 GB, but the WAL file "00000001000000150000008F" header specifies 0 bytes
---

I cannot read wal data.
This file size is 16MB (according to the wal size setting).
But the content is all zero data. I checked this situation with "od"
command.

---
$ od -N 40 -A d -v pg_wal/00000001000000150000008F
0000000 000000 000000 000000 000000 000000 000000 000000 000000
0000016 000000 000000 000000 000000 000000 000000 000000 000000
0000032 000000 000000 000000 000000
0000040
---

The WAL log of the previous one could be seen with pg_waldump.
However, I could not find any log corresponding to the latest checkpoint
location (8FB002C0) listed in pg_controldata.

2) Run pg_resetwal and successful start Pod

I ran the pg_resetwal command to repair the WAL trouble.
And the DB pod successfully started.

But, I received select query fail for some tables.

3) failed select query

---
sampledb1=# select * from table1 limit 1;
ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619
---

I thought taht this error is raised for pg_statistic table (and related
toast table (pg_toast_2619)).
So, I deleted the broken rows in pg_statistic table and ran ANALYZE query.
After then, this tables is recovered.

4) another data trouble

I had an another data trouble after pg_resetwal.

In some table, SELECT query did not return any rows.
And INSERT query failed (no response. Waiting forever...)
This table have only primary key index.
I thought it might have been caused by an pkey index trouble after
pg_resetwal.
I didn't know how to repair the index of the primary key, and finally I
dropped table and restore.

I thought this wal trouble was caused by disk IO troubles. But any error
was not raised in OS syslog.
I want to know any other causes.

Beat regards.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2021-03-08 15:28:12 Re: PostgreSQL container crash trouble.
Previous Message Abdul Qoyyuum 2021-03-08 06:41:25 Re: Is it normal to have too many idle processes in postgresql cluster?