Re: root cause of corruption in hot standby

From: Mike Broers <mbroers(at)gmail(dot)com>
To: rui(at)crazybean(dot)net
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: root cause of corruption in hot standby
Date: 2018-10-10 14:15:16
Message-ID: CAB9893iwovZv1uVNuyn=dz7jRHjsY8tkhRQTvAjaGqhLzzH_8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

The replica is instantiated with a pg_basebackup, and seems to run fine for
a few days before the checksum error presents itself. Initially it ran a
few months without issue. The replica vm was created in May and it ran
until September without the checksum error. This time it was 12 days after
a fresh pg_basebackup.

I'll look into rsync checksums, but this corruption presented itself during
a time when streaming replication was working fine and it wasnt restoring
archived rsynced transaction logs, and hadnt done so for around 30 hours.
The table it complained about it is accessed every minute with updates and
monitoring so I dont think it would have taken so long if it was due to the
application of a corrupted wal.

Id like to know if there are diagnostics I can turn to validate the VM and
its configuration.. Checking the usual logging in /var/log and dmesg isnt
showing anything, or chkdsk..

On Tue, Oct 9, 2018 at 7:56 PM Rui DeSousa <rui(at)crazybean(dot)net> wrote:

>
> > On Oct 9, 2018, at 1:04 PM, Mike Broers <mbroers(at)gmail(dot)com> wrote:
> >
> > Ok so I have checksum errors in this replica AGAIN.
>
> Mike,
>
> I don’t think you are dealing with a “Postgres” issue but possibly bit rot
> from either faulty hardware or a misconfiguration in your stack.
>
> If you recall the archive WAL file was originally corrupted. Replicating
> the WAL files is outside the functionally of Postgres thus it would either
> be a file replication issue, bit rot, or some other data corruption issue
> but not Postgres bug.
>
> This leaves me with the follow two points:
>
> 1. How was the replica instance instantiated? I would assume from your
> backup procedures as your backups should be used to help validate them.
> 2. Are there currently any WAL files that are corrupt? You can quickly
> check using rsync with the “—checksum" option but don’t fix the file on the
> target but instead use "—dry-run" just to identify which files might have
> changed first. I would check this every day until the issue is fully
> resolved.
>
> i.e. rsync --archive --checksum --verbose --dry-run {source_wals}
> {replica_wals}
>
> Since you’re confident that you resolved the potential rsync race
> condition in archiving the WAL files we shouldn’t see any differences
> between WALs that have already been transmitted. If we do find WALs that
> are different then you’re dealing with data corruption on the replica and
> need to start looking into your stack and storage system; However, if you
> don’t find any corrupted WALs then question 1 needs to be scrutinized and
> you really need to ensure your backups are rock solid.
>
> I wouldn’t bother rebuilding the VM instance until the problem is
> identified — unless you’re moving it to an all new hardware stack.
>
> P.s. Is there any anti-virus software running on the the server or any
> other software that might modify files on your behalf?
>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Rui DeSousa 2018-10-10 15:22:31 Re: root cause of corruption in hot standby
Previous Message pavan95 2018-10-10 13:43:42 Re: Null value returned by function pg_last_wal_receive_lsn() inLogical Replication