Re: root cause of corruption in hot standby

From: Rui DeSousa <rui(at)crazybean(dot)net>
To: Mike Broers <mbroers(at)gmail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: root cause of corruption in hot standby
Date: 2018-10-10 00:56:13
Message-ID: 8303E556-079F-440A-BE2A-00A0E92EB2C7@crazybean.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


> On Oct 9, 2018, at 1:04 PM, Mike Broers <mbroers(at)gmail(dot)com> wrote:
>
> Ok so I have checksum errors in this replica AGAIN.

Mike,

I don’t think you are dealing with a “Postgres” issue but possibly bit rot from either faulty hardware or a misconfiguration in your stack.

If you recall the archive WAL file was originally corrupted. Replicating the WAL files is outside the functionally of Postgres thus it would either be a file replication issue, bit rot, or some other data corruption issue but not Postgres bug.

This leaves me with the follow two points:

1. How was the replica instance instantiated? I would assume from your backup procedures as your backups should be used to help validate them.
2. Are there currently any WAL files that are corrupt? You can quickly check using rsync with the “—checksum" option but don’t fix the file on the target but instead use "—dry-run" just to identify which files might have changed first. I would check this every day until the issue is fully resolved.

i.e. rsync --archive --checksum --verbose --dry-run {source_wals} {replica_wals}

Since you’re confident that you resolved the potential rsync race condition in archiving the WAL files we shouldn’t see any differences between WALs that have already been transmitted. If we do find WALs that are different then you’re dealing with data corruption on the replica and need to start looking into your stack and storage system; However, if you don’t find any corrupted WALs then question 1 needs to be scrutinized and you really need to ensure your backups are rock solid.

I wouldn’t bother rebuilding the VM instance until the problem is identified — unless you’re moving it to an all new hardware stack.

P.s. Is there any anti-virus software running on the the server or any other software that might modify files on your behalf?

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message pavan95 2018-10-10 13:43:42 Re: Null value returned by function pg_last_wal_receive_lsn() inLogical Replication
Previous Message Mike Broers 2018-10-09 17:04:12 Re: root cause of corruption in hot standby