Re: Trying to handle db corruption 9.6

From: Mariel Cherkassky <mariel(dot)cherkassky(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Bimal <internetuser2008(at)yahoo(dot)com>, Greg Clough <Greg(dot)Clough(at)ihsmarkit(dot)com>, PostgreSQL mailing lists <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Trying to handle db corruption 9.6
Date: 2019-05-21 09:01:31
Message-ID: CA+t6e1k0PdBd9uwTD+BwqwfN4UNtVZht2FSEvpjS_m4Jog7xhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-performance

Tomas :

Well, when you say it does not work, why do you think so? Does it print
some error, or what? Does it even get executed? It does not seem to be
the case, judging by the log (there's no archive_command message).

How was the "secondary machine" created? You said you have all the WAL
since then - how do you know that?

Well, when I start the secondary in recovery mode (the primary is down,
auto failover is disabled..) it doesnt start recovering the archive wals
from the primary. The logs of the secondary :
receiving incremental file list
rsync: link_stat "/var/lib/pgsql/archive/00000002.history" failed: No such
file or directory (2)

sent 8 bytes received 10 bytes 36.00 bytes/sec
total size is 0 speedup is 0.00
rsync error: some files/attrs were not transferred (see previous errors)
(code 23) at main.c(1505) [receiver=3.0.6]
sh: /var/lib/pgsql/archive/00000002.history: No such file or directory
2019-05-20 09:41:33 PDT 18558 LOG: entering standby mode
2019-05-20 09:41:33 PDT 18558 LOG: invalid primary checkpoint record
2019-05-20 09:41:33 PDT 18558 LOG: invalid secondary checkpoint link in
control file
2019-05-20 09:41:33 PDT 18558 PANIC: could not locate a valid checkpoint
record
2019-05-20 09:41:33 PDT 18555 LOG: startup process (PID 18558) was
terminated by signal 6: Aborted
2019-05-20 09:41:33 PDT 18555 LOG: aborting startup due to startup
process failure
2019-05-20 09:41:33 PDT 18555 LOG: database system is shut down
2019-05-20 09:56:12 PDT 18701 LOG: database system was shut down in
recovery at 2019-05-01 09:40:02 PDT

As I said, the secondary was down for a month and I have all the archives
of the wals in my primary. I was hoping that the secondary will use the
restore_command to restore them :
restore_command = 'rsync -avzhe ssh postgres(at)X(dot)X(dot)X(dot)X:/var/lib/pgsql/archive/%f
/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'

my archive_command on the primary was :
archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

Am I missing something ?

Another question, If I'll run initdb and initiate a new cluster and i'll
copy the data files of my old cluster into the new one, is there any chance
that it will work ?
I mean right now, my primary is down and cant start up because it is
missing an offset file in the pg_multixtrans/offset dir.

‫בתאריך יום ג׳, 21 במאי 2019 ב-0:04 מאת ‪Tomas Vondra‬‏ <‪
tomas(dot)vondra(at)2ndquadrant(dot)com‬‏>:‬

> On Mon, May 20, 2019 at 08:20:33PM +0300, Mariel Cherkassky wrote:
> > Hey Greg,
> > Basically my backup was made after the first pg_resetxlog so I was
> wrong.
>
> Bummer.
>
> > However, the customer had a secondary machine that wasn't synced for a
> > month. I have all the walls since the moment the secondary went out of
> > sync. Once I started it I hoped that it will start recover the wals and
> > fill the gap. However I got an error in the secondary :
> > 2019-05-20 10:11:28 PDT 19021 LOG: entering standby mode
> > 2019-05-20 10:11:28 PDT 19021 LOG: invalid primary checkpoint record
> > 2019-05-20 10:11:28 PDT 19021 LOG: invalid secondary checkpoint
> link in
> > control file
> > 2019-05-20 10:11:28 PDT 19021 PANIC: could not locate a valid
> > checkpoint record
> > 2019-05-20 10:11:28 PDT 19018 LOG: startup process (PID 19021) was
> > terminated by signal 6: Aborted
> > 2019-05-20 10:11:28 PDT 19018 LOG: aborting startup due to startup
> > process failure
> > 2019-05-20 10:11:28 PDT 19018 LOG: database system is shut down.
>
> > I checked my secondary archive dir and pg_xlog dir
> and
> > it seems that the restore command doesnt work. My restore_command:
>
> > restore_command = 'rsync -avzhe ssh
> > postgres(at)x(dot)x(dot)x(dot)x:/var/lib/pgsql/archive/%f /var/lib/pgsql/archive/%f ;
> > gunzip < /var/lib/pgsql/archive/%f > %p'
> > archive_cleanup_command = '/usr/pgsql-9.6/bin/pg_archivecleanup
> > /var/lib/pgsql/archive %r'
>
> Well, when you say it does not work, why do you think so? Does it print
> some error, or what? Does it even get executed? It does not seem to be
> the case, judging by the log (there's no archive_command message).
>
> How was the "secondary machine" created? You said you have all the WAL
> since then - how do you know that?
>
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tomas Vondra 2019-05-21 13:07:14 Re: Trying to handle db corruption 9.6
Previous Message Tomas Vondra 2019-05-20 21:04:06 Re: Trying to handle db corruption 9.6

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2019-05-21 13:07:14 Re: Trying to handle db corruption 9.6
Previous Message David Rowley 2019-05-21 02:15:09 Re: Temporarily very slow planning time after a big delete