Re: PostgreSQL container crash trouble.

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: Daisuke Ikeda <dai(dot)ikd123(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: PostgreSQL container crash trouble.
Date: 2021-03-08 15:28:12
Message-ID: 69ceb857cb860ee81c0d221e745c5441655f691e.camel@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 2021-03-08 at 22:30 +0900, Daisuke Ikeda wrote:
> I'm running PostgreSQL (enabled timescaledb extension) in the following environment.
> And I had db pod crashed error several times irregularly.
>
> I want to know the cause of this fail.
>
> Environment:
>
> - Base: Kubernetes(microk8s)
> - DB: PostgreSQL 12.4 (TimescaleDB) Container
> - DB OS: Alpine Linux
> - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts)
> - DB data location: Mounted host directory (for data persistence)
>
> 1) PostgreSQL crash and cannot start Pod..
>
> I noticed that the timescaledb pod was restarted repeatedly with the following error.
>
> ---
> PostgreSQL Database directory appears to contain a database; Skipping initialization
> [1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit
> [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
> [1] LOG: listening on IPv6 address "::", port 5432
> [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
> [20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC
> [20] LOG: invalid primary checkpoint record
> [20] PANIC: could not locate a valid checkpoint record
> [1] LOG: startup process (PID 20) was terminated by signal 6: Aborted
> [1] LOG: aborting startup due to startup process failure
> [1] LOG: database system is shut down
> ---
>
> I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump.

It is WAL trouble.
WAL does not contain the checkpoint from before the crash.

> I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump.
>
> ---
> $ pg_controldata <PostgreSQL data dir>
> [...]
> Latest checkpoint location: 15/8FB002C0
> Latest checkpoint's REDO location: 15/8FAF3018
> Latest checkpoint's REDO WAL file: 00000001000000150000008F
> [...]
>
> I checked the latest checkpoint's REDO WAL file.
>
> ---
> $ pg_waldump -n 10 pg_wal/00000001000000150000008F
> pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0 bytes
> ---
>
> I cannot read wal data.
> This file size is 16MB (according to the wal size setting).
> But the content is all zero data. I checked this situation with "od" command.
>
> ---
> $ od -N 40 -A d -v pg_wal/00000001000000150000008F
> 0000000 000000 000000 000000 000000 000000 000000 000000 000000
> 0000016 000000 000000 000000 000000 000000 000000 000000 000000
> 0000032 000000 000000 000000 000000
> 0000040
> ---

Looks like modifications to this file were lost.

> 2) Run pg_resetwal and successful start Pod
>
> I ran the pg_resetwal command to repair the WAL trouble.
> And the DB pod successfully started.

Yes, but "pg_resetwal" on a crashed cluster leads to data corruption.
The best you can do now is salvage what you can.

> But, I received select query fail for some tables.
>
> 3) failed select query
>
> ---
> sampledb1=# select * from table1 limit 1;
> ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619
> ---
>
> I thought taht this error is raised for pg_statistic table (and related toast table (pg_toast_2619)).
> So, I deleted the broken rows in pg_statistic table and ran ANALYZE query.
> After then, this tables is recovered.

Lucky you!

> 4) another data trouble
>
> I had an another data trouble after pg_resetwal.
>
> In some table, SELECT query did not return any rows.
> And INSERT query failed (no response. Waiting forever...)
> This table have only primary key index.
> I thought it might have been caused by an pkey index trouble after pg_resetwal.
> I didn't know how to repair the index of the primary key, and finally I dropped table and restore.

You should "pg_dumpall" the cluster and load it into a cluster
that was newly created with "initdb".

Any problems loading the data have to be resolved manually.

> I thought this wal trouble was caused by disk IO troubles. But any error was not raised in OS syslog.
> I want to know any other causes.

One cause might be unreliable storage that doesn't honor fsync requests
correctly. But given that your checkpoint location is pretty late in
the file, it seems unlikely that none of the data did make it to disk.

Somehow you lost the data for a WAL segment, and you cannot recover.

Of course, if you have a backup, you know what to do.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Geoff Winkless 2021-03-08 16:15:51 updating PGDG 12 devel on centos7 requires llvm5.0
Previous Message Daisuke Ikeda 2021-03-08 13:30:04 PostgreSQL container crash trouble.