Re: root cause of corruption in hot standby

From: Mike Broers <mbroers(at)gmail(dot)com>
To: rui(at)crazybean(dot)net, pgsql-admin(at)postgresql(dot)org
Subject: Re: root cause of corruption in hot standby
Date: 2018-10-09 17:04:12
Message-ID: CAB9893ifAVmNGqsdyxPK_Ca_rgr6bwooEebHLvNNb0-ASBECsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Ok so I have checksum errors in this replica AGAIN.

To remind the list of the scenario:
primary 9.5.14 db centos 7.4 ext4 filesystem on vmware in colo hosting
environment with ssd san
one replica in same colo with same os configuration on hyper-v vm on the
same san starts reporting checksum error
second replica with same os configuration in AWS does not have this
problem. both are connected directly to primary as hot standbys using
streaming replication, with a restore command and rsynced archivelogs if
streaming falls behind primary xlog.

This occurred during times of relatively low activity on the primary, I
verified in the log that streaming replication is connected and it isnt
happening during the application of an rsync'ed transaction log.
Previously it was thought that maybe i was rsyncing transaction logs as
they were being written and this was the cause. Im doubting that now and
Ive changed the rsync command to base its list on a find with -mmin +10 to
avoid rsyncing files unless their mtime is older than 10 minutes.

The last time this happened I ran chkdsk (no errors) and even reformatted
the volumes. Is there any extensive testing I can do to verify the memory
isnt the problem or some other tests to try to determine root cause? Its a
hyper-v vm so Im not sure how the virtual hardware abstraction is playing a
role. My next step is going to be destroying and recreating the vm, but Im
going to have a hard time being confident in the replica environment if I
have no idea what is causing this.

Many other VMs share the hyperv environment, no other hosts have complained
about this kind of problem.

Thanks for reading, any guidance is appreciated.

On Thu, Sep 20, 2018 at 3:17 PM Mike Broers <mbroers(at)gmail(dot)com> wrote:

> Strange to finally hit the race condition after using this approach for so
> long on so many different databases but I can accept my understanding was
> flawed. Im trying to decide a path to fix this going forward.
>
> If I switch to using a script to archive, and have the local copy as well
> as the rsync in the script, then Im assuming that would be the best way as
> postgres is only rsyncing AFTER the file has been fully written?
>
> Something akin to:
> archive_command = 'cp "%p" /mnt/backup/pgsql/9.5/archive"%f" && rsync -a
> "%p" postgres(at)destination_host:/mnt/backup/pgsql/archive/"%f"'
>
> I dont really like the idea of archive_command needing an rsync to succeed
> to rotate through pg_xlogs but perhaps I can wire that in a bash script so
> it exits with 0 but notifies me in some other way or something..
>
> Thanks for commenting Rui!
>
> On Thu, Sep 20, 2018 at 2:00 PM Rui DeSousa <rui(at)crazybean(dot)net> wrote:
>
>> Hi Mike,
>>
>> Using rsync on an active file will lead to a “corrupted” file; which is
>> normal; however, there is a race condition with rsync and the filesystem
>> metadata. This means on subsequent rsync runs the file may not get rsync’d
>> when it really should. That is what the -c “checksum" option in rsync is
>> for; it will checksum the files to determine if it needs to be rsync'd and
>> not only rely on filesystem metadata.
>>
>> Using the -c option should fix your issue; however, I don’t think you
>> really want to use it as that would cause a lot of unnecessary I/O to
>> compare all the already archived files. A better approach would be to
>> archive the file with a temporary name and when it’s fully archived rename
>> it to the final name.
>>
>> What are you using for archiving, a script or just copy?
>>
>>
>>
>> On Sep 20, 2018, at 11:13 AM, Mike Broers <mbroers(at)gmail(dot)com> wrote:
>>
>> I ran fsck and reran mkfs.ext4 in case there was something wrong with the
>> underlying file systems, reran a pg_basebackup and got the same problem.
>> It seems to be choking up on an archived wal that was rsync'ed during a
>> time of a nightly vacuum.
>>
>> 2018-09-20 09:48:22 CDT [21017]: [5305-1] user=,db= LOG: restored log
>> file "0000000100007D260000000E" from archive
>> 2018-09-20 09:48:22 CDT [21017]: [5306-1] user=,db= LOG: incorrect
>> resource manager data checksum in record at 7D26/EC04550
>> 2018-09-20 09:48:22 CDT [28296]: [1-1] user=,db= LOG: started streaming
>> WAL from primary at 7D26/E000000 on timeline 1
>> 2018-09-20 09:48:22 CDT [28296]: [2-1] user=,db= FATAL: could not
>> receive data from WAL stream: ERROR: requested WAL segment
>> 0000000100007D260000000E has already been removed
>>
>> I dont know if this is a good test but I attempted to diff the file from
>> primary and the file that got rsynced and they 'differ'
>>
>> -bash-4.2$ diff 0000000100007D260000000E 0000000100007D260000000E.sch1-db8
>>
>> Binary files 0000000100007D260000000E and
>> 0000000100007D260000000E.sch1-db8 differ
>>
>> It seems like my archived wal rsync strategy is flawed? Im just running a
>> cron to rsync from the primary server to the replica every 15 minutes.
>>
>> rsync -az -B 16384 --rsh=ssh ${PGPRIMARYHOST}/archive ${PGBACKUP}/archive
>>
>> Although its strange to me if its rsyncing a file before writes complete
>> why are they the same size?
>>
>> -rw------- 1 postgres postgres 16777216 Sep 20 03:57
>> 0000000100007D260000000E
>> -bash-4.2$ ls -l 0000000100007D260000000E.sch1-db8
>> -rw------- 1 postgres postgres 16777216 Sep 20 10:05
>> 0000000100007D260000000E.sch1-db8
>>
>> I will RTD on how it is advised to copy the wals and see if I am missing
>> something obvious, but any help is always appreciated :)
>>
>> Mike
>>
>>
>>
>> On Wed, Sep 19, 2018 at 9:14 AM Mike Broers <mbroers(at)gmail(dot)com> wrote:
>>
>>> A fresh replica using pg_basebackup on the same system generated similar
>>> errors:.
>>>
>>> cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No
>>> such file or directory
>>> 2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG: restored log
>>> file "0000000100007CFC00000040" from archive
>>> 2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG: incorrect
>>> resource manager data checksum in record at 7CFC/405ED198
>>>
>>> I am going to run file system checks now.. Maybe the backup volume the
>>> archived wals get rsync'ed to has problems and is corrupting on the
>>> replay? No checksum alerts on primary or an additional replica..
>>>
>>> If I can supply additional info that would help get some advice please
>>> let me know.
>>>
>>> Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM
>>>
>>> On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers(at)gmail(dot)com> wrote:
>>>
>>>> Well I've verified my primary backups are working, and think my plan is
>>>> to patch to 9.5.14, reprime a replica in the same environment and see how
>>>> it goes unless someone has an idea of something to check on the host to
>>>> avoid future corruption...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers(at)gmail(dot)com> wrote:
>>>>
>>>>> So I have discovered corruption in a postgres 9.5.12 read replica, yay
>>>>> checksums!
>>>>>
>>>>> 2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production
>>>>> WARNING: page verification failed, calculated checksum 3482 but expected
>>>>> 32232
>>>>>
>>>>> 2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production
>>>>> ERROR: invalid page in block 15962 of relation base/16384/464832386
>>>>>
>>>>> The rest of the log is clean and just has usual monitoring queries as
>>>>> this isnt a heavily used db.
>>>>>
>>>>> This corruption isnt occurring on the primary or a second replica, so
>>>>> I'm not freaking out exactly, but Im not sure how I can further diagnose
>>>>> what the root cause of the corruption might be.
>>>>>
>>>>> There were no power outages. This is a streaming hot standby replica
>>>>> that looks like it was connected fine to its primary xlog at the time, and
>>>>> not falling back on rsync'ed WALS or anything. We run off an SSD SAN that
>>>>> is allocated using LVM and I've noticed documentation that states that can
>>>>> be problematic, but I'm unclear on how to diagnose what might have been the
>>>>> root cause and now I'm somewhat uncomfortable with this environments
>>>>> reliability in general.
>>>>>
>>>>> Does anyone have advice for what to check further to determine a
>>>>> possible root cause? This is a CentOS 7 vm running on Hyper-V.
>>>>>
>>>>> Thanks for any assistance, greatly appreciated!
>>>>> Mike
>>>>>
>>>>>
>>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Rui DeSousa 2018-10-10 00:56:13 Re: root cause of corruption in hot standby
Previous Message Campbell, Lance 2018-10-09 15:38:23 Re: Should I use table partitions