Re: Funny WAL corruption issue

From: Chris Travers <chris(dot)travers(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Vladimir Rusinov <vrusinov(at)google(dot)com>, Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>, Vladimir Borodin <root(at)simply(dot)name>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Funny WAL corruption issue
Date: 2017-08-11 12:53:36
Message-ID: CAKt_ZfuCDb_yCDVfXafSM5bQDY56FPPLDfdipZAgPFFq8xkCag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark <stark(at)mit(dot)edu> wrote:

> On 10 August 2017 at 15:26, Chris Travers <chris(dot)travers(at)gmail(dot)com> wrote:
> >
> >
> > The bitwise comparison is interesting. Remember the error was:
> >
> > pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected
> > pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> > 1146880
> ...
> > Since this didn't throw a checksum error (we have data checksums
> disabled but wal records ISTR have a separate CRC check), would this
> perhaps indicate that the checksum operated over incorrect data?
>
> No checksum error and this "unexpected pageaddr" doesn't necessarily
> mean data corruption. It could mean that when the database stopped logging
> it was reusing a wal file and the old wal stream had a record boundary
> on the same byte position. So the previous record checksum passed and
> the following record checksum passes but the record header is for a
> different wal stream position.
>

I expect to test this theory shortly.

Assuming it is correct, what can we do to prevent restarts of slaves from
running into it?

> I think you could actually hack xlogdump to ignore this condition and
> keep outputting and you'll see whether the records that follow appear
> to be old wal log data. I haven't actually tried this though.
>
> --
> greg
>

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-08-11 13:00:57 Re: SCRAM protocol documentation
Previous Message Augustine, Jobin 2017-08-11 12:41:03 Re: [HACKERS] Replication to Postgres 10 on Windows is broken