AW: Re: PostgreSQL container crash trouble.

From: "Zwettler Markus (OIZ)" <Markus(dot)Zwettler(at)zuerich(dot)ch>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Cc: Daisuke Ikeda <dai(dot)ikd123(at)gmail(dot)com>
Subject: AW: Re: PostgreSQL container crash trouble.
Date: 2021-03-11 10:52:38
Message-ID: 07b9abcdad504a179325ec6431327d8e@zuerich.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Laurenz,

You said use your backup.

Wouldn't it be enough to start instance crash recovery just before the corrupted wal in that case?

recovery_target_lsn = <<lsn for 00000001000000150000008F>>
recovery_target_inclusive = off

No need for a backup. Correct?

Cheers, Markus

> -----Ursprüngliche Nachricht-----
> Von: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
> Gesendet: Montag, 8. März 2021 16:28
> An: Daisuke Ikeda <dai(dot)ikd123(at)gmail(dot)com>; pgsql-general(at)lists(dot)postgresql(dot)org
> Betreff: [Extern] Re: PostgreSQL container crash trouble.
>
> On Mon, 2021-03-08 at 22:30 +0900, Daisuke Ikeda wrote:
> > I'm running PostgreSQL (enabled timescaledb extension) in the following
> environment.
> > And I had db pod crashed error several times irregularly.
> >
> > I want to know the cause of this fail.
> >
> > Environment:
> >
> > - Base: Kubernetes(microk8s)
> > - DB: PostgreSQL 12.4 (TimescaleDB) Container
> > - DB OS: Alpine Linux
> > - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts)
> > - DB data location: Mounted host directory (for data persistence)
> >
> > 1) PostgreSQL crash and cannot start Pod..
> >
> > I noticed that the timescaledb pod was restarted repeatedly with the following
> error.
> >
> > ---
> > PostgreSQL Database directory appears to contain a database; Skipping
> > initialization [1] LOG: starting PostgreSQL 12.4 on
> > x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit [1]
> > LOG: listening on IPv4 address "0.0.0.0", port 5432 [1] LOG:
> > listening on IPv6 address "::", port 5432 [1] LOG: listening on Unix socket
> "/var/run/postgresql/.s.PGSQL.5432"
> > [20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC
> > [20] LOG: invalid primary checkpoint record [20] PANIC: could not
> > locate a valid checkpoint record [1] LOG: startup process (PID 20)
> > was terminated by signal 6: Aborted [1] LOG: aborting startup due to
> > startup process failure [1] LOG: database system is shut down
> > ---
> >
> > I thought it might be WAL trouble and tried to check with pg_controldata and
> pg_waldump.
>
> It is WAL trouble.
> WAL does not contain the checkpoint from before the crash.
>
> > I thought it might be WAL trouble and tried to check with pg_controldata and
> pg_waldump.
> >
> > ---
> > $ pg_controldata <PostgreSQL data dir> [...]
> > Latest checkpoint location: 15/8FB002C0
> > Latest checkpoint's REDO location: 15/8FAF3018
> > Latest checkpoint's REDO WAL file: 00000001000000150000008F
> > [...]
> >
> > I checked the latest checkpoint's REDO WAL file.
> >
> > ---
> > $ pg_waldump -n 10 pg_wal/00000001000000150000008F
> > pg_waldump: fatal: WAL segment size must be a power of two between 1
> > MB and 1 GB, but the WAL file "00000001000000150000008F" header
> > specifies 0 bytes
> > ---
> >
> > I cannot read wal data.
> > This file size is 16MB (according to the wal size setting).
> > But the content is all zero data. I checked this situation with "od" command.
> >
> > ---
> > $ od -N 40 -A d -v pg_wal/00000001000000150000008F
> > 0000000 000000 000000 000000 000000 000000 000000 000000 000000
> > 0000016 000000 000000 000000 000000 000000 000000 000000 000000
> > 0000032 000000 000000 000000 000000
> > 0000040
> > ---
>
> Looks like modifications to this file were lost.
>
> > 2) Run pg_resetwal and successful start Pod
> >
> > I ran the pg_resetwal command to repair the WAL trouble.
> > And the DB pod successfully started.
>
> Yes, but "pg_resetwal" on a crashed cluster leads to data corruption.
> The best you can do now is salvage what you can.
>
> > But, I received select query fail for some tables.
> >
> > 3) failed select query
> >
> > ---
> > sampledb1=# select * from table1 limit 1;
> > ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619
> > ---
> >
> > I thought taht this error is raised for pg_statistic table (and related toast table
> (pg_toast_2619)).
> > So, I deleted the broken rows in pg_statistic table and ran ANALYZE query.
> > After then, this tables is recovered.
>
> Lucky you!
>
> > 4) another data trouble
> >
> > I had an another data trouble after pg_resetwal.
> >
> > In some table, SELECT query did not return any rows.
> > And INSERT query failed (no response. Waiting forever...) This table
> > have only primary key index.
> > I thought it might have been caused by an pkey index trouble after pg_resetwal.
> > I didn't know how to repair the index of the primary key, and finally I dropped
> table and restore.
>
> You should "pg_dumpall" the cluster and load it into a cluster that was newly
> created with "initdb".
>
> Any problems loading the data have to be resolved manually.
>
> > I thought this wal trouble was caused by disk IO troubles. But any error was not
> raised in OS syslog.
> > I want to know any other causes.
>
> One cause might be unreliable storage that doesn't honor fsync requests correctly.
> But given that your checkpoint location is pretty late in the file, it seems unlikely
> that none of the data did make it to disk.
>
> Somehow you lost the data for a WAL segment, and you cannot recover.
>
> Of course, if you have a backup, you know what to do.
>
> Yours,
> Laurenz Albe
> --
> Cybertec | https://www.cybertec-postgresql.com
>
>
>
>
> Achtung: Diese E-Mail wurde von einer externen Adresse verschickt. Klicken Sie
> auf keine Links und öffnen Sie keine angehängten Dateien, wenn Sie den Absender
> bzw. die Absenderin nicht kennen. Sind Sie sich unsicher, kontaktieren Sie den
> Service Desk der Stadt Zürich.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2021-03-11 13:22:55 Re: AW: Re: PostgreSQL container crash trouble.
Previous Message Valeria Kaplan 2021-03-11 10:03:56 Re: Code of Conduct Russian Translation - Revised March 5, 2021