From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> |
Cc: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Oskari Saarenmaa <os(at)ohmu(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: emergency outage requiring database restart |
Date: | 2017-01-18 14:33:50 |
Message-ID: | CAHyXU0ypCaDJMJ78H6EdKztZeh5oEkGu+j5HpwmfzOpWB4q1zg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> Still getting checksum failures. Over the last 30 days, I see the
>> following. Since enabling checksums FWICT none of the damage is
>> permanent and rolls back with the transaction. So creepy!
>
> The checksums still only differ in least significant digits which
> pretty much means that there is a block number mismatch. So if you
> rule out filesystem not doing its job correctly and transposing
> blocks, it could be something else that is resulting in blocks getting
> read from a location that happens to differ by a small multiple of
> page size. Maybe somebody is racily mucking with table fd's between
> seeking and reading. That would explain the issue disappearing after a
> retry.
>
> Maybe you can arrange for the RelFileNode and block number to be
> logged for the checksum failures and check what the actual checksums
> are in data files surrounding the failed page. If the requested block
> number contains something completely else, but the page that follows
> contains the expected checksum value, then it would support this
> theory.
will do. Main challenge is getting hand compiled server to swap in
so that libdir continues to work. Getting access to the server is
difficult as is getting a maintenance window. I'll post back ASAP.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2017-01-18 14:36:08 | pgsql: Add function to import operating system collations |
Previous Message | Amit Kapila | 2017-01-18 14:18:24 | Re: Parallel Index Scans |