Re: root cause of corruption in hot standby

From: Mike Broers <mbroers(at)gmail(dot)com>
To: rui(at)crazybean(dot)net
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: root cause of corruption in hot standby
Date: 2018-10-10 18:30:22
Message-ID: CAB9893i7K7uYQBWd8zQzKpQwRqwOXZxUXzsfTgBOA5jgTzHSUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Agreed, no good threads to pull for definite answers. Ive added a checksum
to the rsync as a precaution. The SAN is the same between the two hosts.
One is running on a VMWare server, the other is on a Hyper-V host. We have
a few other postgres database hosts on that same hyper-v infrastructure and
they all check out fine as far as I can tell.

The only thing I could guess at this point is that although primary and
replica are running on ext4 under LVM, the replica data volumes were grown
online at some point because we were intending to cutover into an
environment with more space. This happened a while back and doesnt
correspond to any of the times the checksum errors occurred in the logs,
and Ive started with fresh pg_basebackups twice since then, so Im skeptical
that is a cause, but its all I've got.

There may be some additional hyper-v fsync write-thru cache configurations
I'm unaware of as well. That is my fear at this point.

Thanks for following up again.

On Wed, Oct 10, 2018 at 10:22 AM Rui DeSousa <rui(at)crazybean(dot)net> wrote:

>
> > On Oct 10, 2018, at 10:15 AM, Mike Broers <mbroers(at)gmail(dot)com> wrote:
> >
> >
> > I'll look into rsync checksums, but this corruption presented itself
> during a time when streaming replication was working fine and it wasnt
> restoring archived rsynced transaction logs, and hadnt done so for around
> 30 hours. The table it complained about it is accessed every minute with
> updates and monitoring so I dont think it would have taken so long if it
> was due to the application of a corrupted wal.
> >
>
> I think you missed my point. If you are dealing with some sort of bit rot
> and/or data corruption on your storage device you need to sort of prove it
> which is very difficult to do.
>
> You have WAL files on primary and the same WAL files on the replica via
> your rsync copy job. If you check and recheck all the WALs daily to see if
> any of the files are changing and find a difference than proves that there
> is some sort of corruption/bit rot occurring as the WAL files are static
> files.
>
> I’ve seen this type of corruption before with RAID controllers that are
> over taxed; where they would corrupt over time periodically. I ended up
> changing from a RAID configuration to a JBOD and managing the disks via ZFS
> instead and never again experience data corruption using the exact same
> hardware. ZFS also detects bit rot and correct for it as well as the
> ability to scrub the pool to ensure the disks are not slowly rotting away.
>
> What storage system is being used? Does it have any measures to prevent
> bit rot? What is the RAID configuration? I would not recommend RAID 5 for a
> database; under heavy load the performance degradation and increase
> likelihood of data corruption is not worth it.
>
> It sounds like you have some sort of environmental issues which is
> corrupting your data and it is not a Postgres issue. The problem you face
> is that without some sort of definitive poof you’ll enter the realm of
> finger pointing… it’s a database issue, no is storage issue, etc.
>
> You have two replicas; one periodically fails and the other does not — the
> only difference is the environment in which they operate.

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Andres Freund 2018-10-10 19:25:53 Re: root cause of corruption in hot standby
Previous Message Rui DeSousa 2018-10-10 15:22:31 Re: root cause of corruption in hot standby