incorrect resource manager data checksum in record

From: Devin Christensen <quixoten(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: incorrect resource manager data checksum in record
Date: 2018-06-28 17:44:03
Message-ID: CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I've been seeing this issue in multiple separate hot standby replication
chains of PostgreSQL servers (5 so far). There are 4 servers in each chain
(some running Ubuntu 14.04 and others Ubuntu 16.04. and PostgreSQL >= 10.1
and <= 11). We also have a mix of ext4 and zfs file systems. Here are the
details for each chain:

First chain
===========
dc1-pg105 (pg 10.1, ub 14.04.5) (primary)
|
V
dc1-pg205 (pg 10.3, ub 16.04.4)
|
V
dc2-pg105 (pg 10.1, ub 14.04.5) <-- error first occurs here
|
V
dc2-pg205 (pg 10.3, ub 16.04.4) <-- and also effects this node

Second chain
===========
dc1-pg106 (pg 10.1, ub 14.04.5, ext4) (primary)
|
V
dc1-pg206 (pg 10.3, ub 16.04.4, zfs)
|
V
dc2-pg106 (pg 10.1, ub 14.04.5, ext4) <-- error first occurs here
|
V
dc2-pg206 (pg 10.3, ub 16.04.4, zfs) <-- and also effects this node

Third chain
===========
dc1-pg107 (pg 10.1, ub 14.04.5, ext4) (primary)
|
V
dc1-pg207 (pg 10.3, ub 16.04.4, zfs)
|
V
dc2-pg107 (pg 10.1, ub 14.04.5, ext4) <-- error first occurs here
|
V
dc2-pg207 (pg 10.3, ub 16.04.4, zfs) <-- and also effects this node

Fourth chain
===========
dc1-pg108 (pg 10.3, ub 16.04.4, ext4) (primary)
|
V
dc1-pg208 (pg 10.3, ub 16.04.4, zfs)
|
V
dc2-pg108 (pg 10.3, ub 16.04.4, ext4) <-- error first occurs here
|
V
dc2-pg208 (pg 10.3, ub 16.04.4, zfs) <-- and also effects this node

Fifth chain
===========
dc1-pg110 (pg 10.3, ub 16.04.4, ext4) (primary)
|
V
dc1-pg210 (pg 10.3, ub 16.04.4, zfs)
|
V
dc2-pg110 (pg 10.3, ub 16.04.4, ext4) <-- error first occurs here
|
V
dc2-pg210 (pg 10.3, ub 16.04.4, zfs) <-- and also effects this node

The pattern is the same, regardless of ubuntu or postgresql versions. I'm
concerned this is somehow a ZFS corruption bug, because the error always
occurs downstream of the first ZFS node and ZFS is a recent addition. I
don't know enough about what this error means, and haven't found much
online. When I restart the nodes effected, replication resumes normally,
with no known side-effects that I've discovered so far, but I'm no longer
confident that the data downstream from the primary is valid. Really not
sure how best to start tackling this issue, and hoping to get some
guidance. The error is infrequent. We have 11 total replication chains, and
this error has occurred on 5 of those chains in approximately 2 months.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Niles Oien 2018-06-28 19:21:14 Re: plperl and plperlu language extentsions
Previous Message joby.john@nccgroup.trust 2018-06-28 16:22:46 Re: Database name with semicolon