Re: PostgreSQL container crash trouble.

From: Daisuke Ikeda <dai(dot)ikd123(at)gmail(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: PostgreSQL container crash trouble.
Date: 2021-03-09 13:21:12
Message-ID: CAJ6DU3EVjGp2GFCdzap635_o6=nRVZDYF2PG9aWM4ubZqhdD1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you for your reply!
It was helpful.

I'll check more about the state of writing on the disk storage.

And I will reinforce the backup & restore strategy.

2021年3月9日(火) 0:28 Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>:

> On Mon, 2021-03-08 at 22:30 +0900, Daisuke Ikeda wrote:
> > I'm running PostgreSQL (enabled timescaledb extension) in the following
> environment.
> > And I had db pod crashed error several times irregularly.
> >
> > I want to know the cause of this fail.
> >
> > Environment:
> >
> > - Base: Kubernetes(microk8s)
> > - DB: PostgreSQL 12.4 (TimescaleDB) Container
> > - DB OS: Alpine Linux
> > - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some
> hosts)
> > - DB data location: Mounted host directory (for data persistence)
> >
> > 1) PostgreSQL crash and cannot start Pod..
> >
> > I noticed that the timescaledb pod was restarted repeatedly with the
> following error.
> >
> > ---
> > PostgreSQL Database directory appears to contain a database; Skipping
> initialization
> > [1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by
> gcc (Alpine 9.3.0) 9.3.0, 64-bit
> > [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
> > [1] LOG: listening on IPv6 address "::", port 5432
> > [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
> > [20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC
> > [20] LOG: invalid primary checkpoint record
> > [20] PANIC: could not locate a valid checkpoint record
> > [1] LOG: startup process (PID 20) was terminated by signal 6: Aborted
> > [1] LOG: aborting startup due to startup process failure
> > [1] LOG: database system is shut down
> > ---
> >
> > I thought it might be WAL trouble and tried to check with pg_controldata
> and pg_waldump.
>
> It is WAL trouble.
> WAL does not contain the checkpoint from before the crash.
>
> > I thought it might be WAL trouble and tried to check with pg_controldata
> and pg_waldump.
> >
> > ---
> > $ pg_controldata <PostgreSQL data dir>
> > [...]
> > Latest checkpoint location: 15/8FB002C0
> > Latest checkpoint's REDO location: 15/8FAF3018
> > Latest checkpoint's REDO WAL file: 00000001000000150000008F
> > [...]
> >
> > I checked the latest checkpoint's REDO WAL file.
> >
> > ---
> > $ pg_waldump -n 10 pg_wal/00000001000000150000008F
> > pg_waldump: fatal: WAL segment size must be a power of two between 1 MB
> and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0
> bytes
> > ---
> >
> > I cannot read wal data.
> > This file size is 16MB (according to the wal size setting).
> > But the content is all zero data. I checked this situation with "od"
> command.
> >
> > ---
> > $ od -N 40 -A d -v pg_wal/00000001000000150000008F
> > 0000000 000000 000000 000000 000000 000000 000000 000000 000000
> > 0000016 000000 000000 000000 000000 000000 000000 000000 000000
> > 0000032 000000 000000 000000 000000
> > 0000040
> > ---
>
> Looks like modifications to this file were lost.
>
> > 2) Run pg_resetwal and successful start Pod
> >
> > I ran the pg_resetwal command to repair the WAL trouble.
> > And the DB pod successfully started.
>
> Yes, but "pg_resetwal" on a crashed cluster leads to data corruption.
> The best you can do now is salvage what you can.
>
> > But, I received select query fail for some tables.
> >
> > 3) failed select query
> >
> > ---
> > sampledb1=# select * from table1 limit 1;
> > ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619
> > ---
> >
> > I thought taht this error is raised for pg_statistic table (and related
> toast table (pg_toast_2619)).
> > So, I deleted the broken rows in pg_statistic table and ran ANALYZE
> query.
> > After then, this tables is recovered.
>
> Lucky you!
>
> > 4) another data trouble
> >
> > I had an another data trouble after pg_resetwal.
> >
> > In some table, SELECT query did not return any rows.
> > And INSERT query failed (no response. Waiting forever...)
> > This table have only primary key index.
> > I thought it might have been caused by an pkey index trouble after
> pg_resetwal.
> > I didn't know how to repair the index of the primary key, and finally I
> dropped table and restore.
>
> You should "pg_dumpall" the cluster and load it into a cluster
> that was newly created with "initdb".
>
> Any problems loading the data have to be resolved manually.
>
> > I thought this wal trouble was caused by disk IO troubles. But any error
> was not raised in OS syslog.
> > I want to know any other causes.
>
> One cause might be unreliable storage that doesn't honor fsync requests
> correctly. But given that your checkpoint location is pretty late in
> the file, it seems unlikely that none of the data did make it to disk.
>
> Somehow you lost the data for a WAL segment, and you cannot recover.
>
> Of course, if you have a backup, you know what to do.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Joe Conway 2021-03-09 13:26:49 Re: Question about when PostgreSQL 11.0 was released
Previous Message Greg Sabino Mullane 2021-03-09 13:20:04 Re: Question about when PostgreSQL 11.0 was released