Re: Funny WAL corruption issue

From: Vladimir Rusinov <vrusinov(at)google(dot)com>
To: Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>
Cc: Chris Travers <chris(dot)travers(at)gmail(dot)com>, Vladimir Borodin <root(at)simply(dot)name>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Funny WAL corruption issue
Date: 2017-08-10 13:17:59
Message-ID: CAE1wr-wO_r3fGp98mu+A7THvWJLa+2f7iXzaxp5YQ7S-oe1JAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <
a(dot)alekseev(at)postgrespro(dot)ru> wrote:

> I just wanted to point out that a hardware issue or third party software
> issues (bugs in FS, software RAID, ...) could not be fully excluded from
> the list of suspects. According to the talk by Christophe Pettus [1]
> it's not that uncommon as most people think.

This still might be the case of hardware corruption, but it does not look
like one.

Likelihood of two different persons seeing similar error message just a
year apart is low. From our practice hardware corruption usually looks like
a random single bit flip (most common - bad cpu or memory), bunch of zeroes
(bad storage), or bunch of complete garbage (usually indicates in-memory
pointer corruption).

Chris, if you still have original WAL segment from the master and it's
corrupt copy from standby, can you do bit-by-bit comparison to see how they
are different? Also, if you can please share some hardware details.
Specifically, do you use ECC? If so, are there any ECC errors logged? Do
you use physical disks/ssd or some form of storage virtualization?

Also, in absolute majority of cases corruption is caught by checksums. I am
not familiar with WAL protocol - do we have enough checksums when writing
it out and on the wire? I suspect there are much more things PostgreSQL can
do to be more resilient, and at least detect corruptions earlier.

--
Vladimir Rusinov
PostgreSQL SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-10 13:26:48 Re: Partition-wise join for join between (declaratively) partitioned tables
Previous Message Nicolas Thauvin 2017-08-10 13:00:22 Foreign tables privileges not shown in information_schema.table_privileges