Re: root cause of corruption in hot standby

From: Mike Broers <mbroers(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: root cause of corruption in hot standby
Date: 2018-09-20 15:13:03
Message-ID: CAB9893hmmv-0o=yBLmKmSUHxX0g7nX5iSV5iG1vLBVj8W8EwJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I ran fsck and reran mkfs.ext4 in case there was something wrong with the
underlying file systems, reran a pg_basebackup and got the same problem.
It seems to be choking up on an archived wal that was rsync'ed during a
time of a nightly vacuum.

2018-09-20 09:48:22 CDT [21017]: [5305-1] user=,db= LOG: restored log file
"0000000100007D260000000E" from archive
2018-09-20 09:48:22 CDT [21017]: [5306-1] user=,db= LOG: incorrect
resource manager data checksum in record at 7D26/EC04550
2018-09-20 09:48:22 CDT [28296]: [1-1] user=,db= LOG: started streaming
WAL from primary at 7D26/E000000 on timeline 1
2018-09-20 09:48:22 CDT [28296]: [2-1] user=,db= FATAL: could not receive
data from WAL stream: ERROR: requested WAL segment
0000000100007D260000000E has already been removed

I dont know if this is a good test but I attempted to diff the file from
primary and the file that got rsynced and they 'differ'

-bash-4.2$ diff 0000000100007D260000000E 0000000100007D260000000E.sch1-db8

Binary files 0000000100007D260000000E and 0000000100007D260000000E.sch1-db8
differ

It seems like my archived wal rsync strategy is flawed? Im just running a
cron to rsync from the primary server to the replica every 15 minutes.

rsync -az -B 16384 --rsh=ssh ${PGPRIMARYHOST}/archive ${PGBACKUP}/archive

Although its strange to me if its rsyncing a file before writes complete
why are they the same size?

-rw------- 1 postgres postgres 16777216 Sep 20 03:57
0000000100007D260000000E

-bash-4.2$ ls -l 0000000100007D260000000E.sch1-db8
-rw------- 1 postgres postgres 16777216 Sep 20 10:05
0000000100007D260000000E.sch1-db8

I will RTD on how it is advised to copy the wals and see if I am missing
something obvious, but any help is always appreciated :)

Mike

On Wed, Sep 19, 2018 at 9:14 AM Mike Broers <mbroers(at)gmail(dot)com> wrote:

> A fresh replica using pg_basebackup on the same system generated similar
> errors:.
>
> cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such
> file or directory
> 2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG: restored log file
> "0000000100007CFC00000040" from archive
> 2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG: incorrect
> resource manager data checksum in record at 7CFC/405ED198
>
> I am going to run file system checks now.. Maybe the backup volume the
> archived wals get rsync'ed to has problems and is corrupting on the
> replay? No checksum alerts on primary or an additional replica..
>
> If I can supply additional info that would help get some advice please let
> me know.
>
> Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM
>
> On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers(at)gmail(dot)com> wrote:
>
>> Well I've verified my primary backups are working, and think my plan is
>> to patch to 9.5.14, reprime a replica in the same environment and see how
>> it goes unless someone has an idea of something to check on the host to
>> avoid future corruption...
>>
>>
>>
>>
>>
>> On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers(at)gmail(dot)com> wrote:
>>
>>> So I have discovered corruption in a postgres 9.5.12 read replica, yay
>>> checksums!
>>>
>>> 2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production
>>> WARNING: page verification failed, calculated checksum 3482 but expected
>>> 32232
>>>
>>> 2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production
>>> ERROR: invalid page in block 15962 of relation base/16384/464832386
>>>
>>> The rest of the log is clean and just has usual monitoring queries as
>>> this isnt a heavily used db.
>>>
>>> This corruption isnt occurring on the primary or a second replica, so
>>> I'm not freaking out exactly, but Im not sure how I can further diagnose
>>> what the root cause of the corruption might be.
>>>
>>> There were no power outages. This is a streaming hot standby replica
>>> that looks like it was connected fine to its primary xlog at the time, and
>>> not falling back on rsync'ed WALS or anything. We run off an SSD SAN that
>>> is allocated using LVM and I've noticed documentation that states that can
>>> be problematic, but I'm unclear on how to diagnose what might have been the
>>> root cause and now I'm somewhat uncomfortable with this environments
>>> reliability in general.
>>>
>>> Does anyone have advice for what to check further to determine a
>>> possible root cause? This is a CentOS 7 vm running on Hyper-V.
>>>
>>> Thanks for any assistance, greatly appreciated!
>>> Mike
>>>
>>>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Laurenz Albe 2018-09-20 16:22:19 Re: Problems when initdb on AIX 5.3
Previous Message Van Gompel Kurt 2018-09-20 14:13:32 RE: Repmgr 4, 2 primary servers in cluster