Re: root cause of corruption in hot standby

From: Mike Broers <mbroers(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: root cause of corruption in hot standby
Date: 2018-09-19 14:14:33
Message-ID: CAB9893honA2P0pMqZDaLQ5CwAaZntYoKTzZ4r+yy1CxULZEgkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

A fresh replica using pg_basebackup on the same system generated similar
errors:.

cp: cannot stat ‘/mnt/backup/pgsql/9.5/archive/00000002.history’: No such
file or directory
2018-09-19 08:36:23 CDT [57006]: [179-1] user=,db= LOG: restored log file
"0000000100007CFC00000040" from archive
2018-09-19 08:36:23 CDT [57006]: [180-1] user=,db= LOG: incorrect resource
manager data checksum in record at 7CFC/405ED198

I am going to run file system checks now.. Maybe the backup volume the
archived wals get rsync'ed to has problems and is corrupting on the
replay? No checksum alerts on primary or an additional replica..

If I can supply additional info that would help get some advice please let
me know.

Postgres 9.5.14, CentOS 7, ext4 filesystem, hyper-v VM

On Mon, Sep 10, 2018 at 8:40 AM Mike Broers <mbroers(at)gmail(dot)com> wrote:

> Well I've verified my primary backups are working, and think my plan is to
> patch to 9.5.14, reprime a replica in the same environment and see how it
> goes unless someone has an idea of something to check on the host to avoid
> future corruption...
>
>
>
>
>
> On Thu, Sep 6, 2018 at 12:28 PM Mike Broers <mbroers(at)gmail(dot)com> wrote:
>
>> So I have discovered corruption in a postgres 9.5.12 read replica, yay
>> checksums!
>>
>> 2018-09-06 12:00:53 CDT [1563]: [4-1] user=postgres,db=production
>> WARNING: page verification failed, calculated checksum 3482 but expected
>> 32232
>>
>> 2018-09-06 12:00:53 CDT [1563]: [5-1] user=postgres,db=production ERROR:
>> invalid page in block 15962 of relation base/16384/464832386
>>
>> The rest of the log is clean and just has usual monitoring queries as
>> this isnt a heavily used db.
>>
>> This corruption isnt occurring on the primary or a second replica, so I'm
>> not freaking out exactly, but Im not sure how I can further diagnose what
>> the root cause of the corruption might be.
>>
>> There were no power outages. This is a streaming hot standby replica
>> that looks like it was connected fine to its primary xlog at the time, and
>> not falling back on rsync'ed WALS or anything. We run off an SSD SAN that
>> is allocated using LVM and I've noticed documentation that states that can
>> be problematic, but I'm unclear on how to diagnose what might have been the
>> root cause and now I'm somewhat uncomfortable with this environments
>> reliability in general.
>>
>> Does anyone have advice for what to check further to determine a possible
>> root cause? This is a CentOS 7 vm running on Hyper-V.
>>
>> Thanks for any assistance, greatly appreciated!
>> Mike
>>
>>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Francesco Vecchio 2018-09-19 17:24:12 Message-ID: ef4636e12771c35fa79d213b14a51729e1f99c5a.camel@cybertec.at
Previous Message Shreeyansh Dba 2018-09-19 13:05:34 Re: postgis Installation