Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0

From: Shaun Thomas <sthomas(at)optionshouse(dot)com>
To: 'Mike Broers' <mbroers(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0
Date: 2013-11-26 20:20:00
Message-ID: 0683F5F5A5C7FE419A752A034B4A0B97975AAAED@sswchi5pmbx2.peak6.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> So if this problem replicated to our standby servers does that indicate
> that the potential problematic fsync occurred during a pg_xlog write?

Pretty much. You have a couple issues here, and no easy way to approach them. Primarily, you got data corruption during a sync operation. This means either the OS or the hardware somewhere along the line lied about the write, or the write was corrupted and the filesystem log replayed incorrectly upon reboot. Once that happens, you can't trust *any* data in your database. Pre-checksum PostgreSQL has no way to verify integrity of existing data, and system crashes can corrupt quite a bit of data that was only tangentially involved.

What likely happens in these scenarios, is that the database startup succeeds, and then it read some rows in from a corrupted table. By corrupted, I mean even a single data page with a mangled pointer. That mangled pointer gave the database incorrect information about the state of that data page's contents, and the database continued on that information. That means subsequent transaction logs from that point are *also* corrupt, and hence any streaming or warm standby replicas are subsequently damaged as well. But they'll be damaged differently, because they likely didn't have the initial corruption, just the byte changes dictated by the WAL stream.

Unless you know where the initial corruption came from, the system that caused it should be quarantined for verification. RAM, disk, CPU, everything should pass integrity checks before putting it back into production.

> Would breaking replication at the time of the crash have prevented
> this from cascading or was it already too late at that point?

Most likely. If, at the time of the crash, you switched to one of your replicas and made it the new master, it would give you the opportunity to check out the crashed system before it spread the love. Even if you don't have a true STONITH model, starting up a potentially data-compromised node in an active cluster is a gamble.

I did something similar once. One of our DRBD nodes crashed and came back up and re-attached to the DRBD pair after a quick data discard and replay. I continued with some scheduled system maintenance, and performed a node failover with no incident. It wasn't until 20 minutes later that the corrupt disk pages started making their presence felt, and by then It was too late. Luckily we were still verifying, but with our secondaries ruined, we had to restore from backup. A 30-minute outage became a 4-hour one.

Afterwards, we put in a new policy that any crash means a DRBD verify at minimum, and until the node passes, it is to be considered invalid and unusable. If you haven't already, I suggest something similar for your setup. Verify a crashed node before using it again, no matter how much pressure you're under. It can always get worse.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2013-11-26 20:30:08 Re: Any advantage of using SSL with a certificate of authority?
Previous Message Joey Quinn 2013-11-26 20:19:11 Re: tracking scripts...