Quick Links

Re: Trying to handle db corruption 9.6

From:	Mariel Cherkassky <mariel(dot)cherkassky(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Bimal <internetuser2008(at)yahoo(dot)com>, Greg Clough <Greg(dot)Clough(at)ihsmarkit(dot)com>, PostgreSQL mailing lists <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Trying to handle db corruption 9.6
Date:	2019-05-21 13:12:57
Message-ID:	CA+t6e1m61Q_0+WwJLc5fDc0z4YwcHamEFTN2oRY+fqS2pxEUyQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin pgsql-performance

Tomas - Well, when I run the restore_command manually it works (archive
dir exists on the secondary..). Thank for the explanation on the system
catalogs..

Greg - My restore command copy the wals from archive dir in the primary to
an archive dir in the secondary(different from the pg_xlog in the
secondary). Should I run it manually and see if the archives are copied to
the archive dir in the secondary or should I just copy all of them to the
xlog dir in the secondary ?
I tried to start the secondary as a primary (I have a backup..) but I still
got an error (invalid checkpoint record from primary./ secondary). Does it
means that my backup is corrupted ?

‫בתאריך יום ג׳, 21 במאי 2019 ב-16:07 מאת ‪Tomas Vondra‬‏ <‪
tomas(dot)vondra(at)2ndquadrant(dot)com‬‏>:‬

> On Tue, May 21, 2019 at 12:01:31PM +0300, Mariel Cherkassky wrote:
> >Tomas :
> >
> >Well, when you say it does not work, why do you think so? Does it print
> >some error, or what? Does it even get executed? It does not seem to be
> >the case, judging by the log (there's no archive_command message).
> >
> >How was the "secondary machine" created? You said you have all the WAL
> >since then - how do you know that?
> >
> >Well, when I start the secondary in recovery mode (the primary is down,
> >auto failover is disabled..) it doesnt start recovering the archive wals
> >from the primary. The logs of the secondary :
> >receiving incremental file list
> >rsync: link_stat "/var/lib/pgsql/archive/00000002.history" failed: No such
> >file or directory (2)
> >
> >sent 8 bytes received 10 bytes 36.00 bytes/sec
> >total size is 0 speedup is 0.00
> >rsync error: some files/attrs were not transferred (see previous errors)
> >(code 23) at main.c(1505) [receiver=3.0.6]
> >sh: /var/lib/pgsql/archive/00000002.history: No such file or directory
> >2019-05-20 09:41:33 PDT 18558 LOG: entering standby mode
> >2019-05-20 09:41:33 PDT 18558 LOG: invalid primary checkpoint record
> >2019-05-20 09:41:33 PDT 18558 LOG: invalid secondary checkpoint link in
> >control file
> >2019-05-20 09:41:33 PDT 18558 PANIC: could not locate a valid
> checkpoint
> >record
> >2019-05-20 09:41:33 PDT 18555 LOG: startup process (PID 18558) was
> >terminated by signal 6: Aborted
> >2019-05-20 09:41:33 PDT 18555 LOG: aborting startup due to startup
> >process failure
> >2019-05-20 09:41:33 PDT 18555 LOG: database system is shut down
> >2019-05-20 09:56:12 PDT 18701 LOG: database system was shut down in
> >recovery at 2019-05-01 09:40:02 PDT
> >
> >As I said, the secondary was down for a month and I have all the archives
> >of the wals in my primary. I was hoping that the secondary will use the
> >restore_command to restore them :
> >restore_command = 'rsync -avzhe ssh postgres(at)X(dot)X(dot)X(dot)X
> :/var/lib/pgsql/archive/%f
> >/var/lib/pgsql/archive/%f ; gunzip < /var/lib/pgsql/archive/%f > %p'
> >
> >my archive_command on the primary was :
> >archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'
> >
> >Am I missing something ?
> >
>
> First of all, the way you quote message is damn confusing - there's no
> clear difference between your message and the message you quote. I don't
> know which mail client you're using, but I suppose it can be configured to
> quote sensibly ...
>
> Well, clearly the standby tries to fetch WAL from archive, but the rsync
> command fails for some reason. You're in the position to investigate
> further, because you can run it manually - we can't. This has nothing to
> do with PostgreSQL. My guess is you don't have /var/lib/pgsql/archive on
> the standby, and it's confusing because archive uses the same path.
>
>
> >Another question, If I'll run initdb and initiate a new cluster and i'll
> >copy the data files of my old cluster into the new one, is there any
> chance
> >that it will work ?
> >I mean right now, my primary is down and cant start up because it is
> >missing an offset file in the pg_multixtrans/offset dir.
> >
>
> No, because you won't have contents of system catalogs, mapping the data
> files to relations (tables, indexes) and containing information about the
> structure (which columns / data types are in the data).
>
> The data files are pretty useless on their own. It might be possible to do
> some manualy recovery - say, you might create the same tables in the new
> schema, and then guess which data files belong to them. But there are
> various caveats e.g. due to dropped columns, etc.
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

In response to

Re: Trying to handle db corruption 9.6 at 2019-05-21 13:07:14 from Tomas Vondra

Responses

Re: Trying to handle db corruption 9.6 at 2019-05-21 16:37:27 from Tomas Vondra

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Tomas Vondra	2019-05-21 16:37:27	Re: Trying to handle db corruption 9.6
Previous Message	Tomas Vondra	2019-05-21 13:07:14	Re: Trying to handle db corruption 9.6

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tomas Vondra	2019-05-21 16:37:27	Re: Trying to handle db corruption 9.6
Previous Message	Tomas Vondra	2019-05-21 13:07:14	Re: Trying to handle db corruption 9.6