Re: Standby corruption after master is restarted

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: emre(at)hasegeli(dot)com
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>, gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-16 19:31:08
Message-ID: 30d65a09-417a-530c-d70f-7b01ed66ebc1@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 04/16/2018 04:55 PM, Emre Hasegeli wrote:
>> OK, and on the standby? I suppose it breaks right after the
>> CHECKPOINT_SHUTDOWN record, right?
>
> Actually pg_xlogdump fails right at the SHUTDOWN record:
>
>> rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 1DFB/D6BDDEA0, prev 1DFB/D6BDDE60, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331538
>> rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 1DFB/D6BDDEE0, prev 1DFB/D6BDDEA0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331539
>> rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331540
>> rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid 0, blkref #0: rel 1663/16385/141744 blk 331541
>> pg_xlogdump: FATAL: error in WAL record at 1DFB/D6BDDF90: unexpected pageaddr 1DFB/66BDE000 in log segment 0000000000001DFB000000D6, offset 12443648
>
>> Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
>> depends on packages). I wonder if it does fast/smart/immediate.
>
> I couldn't find it quickly.
>
>> And can you determine how the broken WAL got to the standby? Was it by
>> directly fetching from the primary, or from the barman archive? Is it
>> broken in the archive too?
>
> It must be streamed from the primary. The log was on on the archive
> during the incident. The good log file came there later. Such logs
> are repeated on the standby before the error:
>

OK, so the standby connects to the primary, gets a (corrupted) WAL
segment from it, which then gets archived? That would contradict the
theory of race condition between recycling the WAL segment and copying
it to the standby, because we never recycle segments that have not been
archived yet (obviously).

But I'm not sure that's the exact sequence of events, of course.

Can you check if the "incorrect" part of the WAL segment matches some
previous segment? Verifying that shouldn't be very difficult (just cut a
bunch of bytes using hexdump, compare to the incorrect data). Assuming
you still have the WAL archive, of course. That would tell us that the
corrupted part comes from an old recycled segment.

>> 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688
>> 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5
>> 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time
>> WARNING: terminating connection because of crash of another server process
>> DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared m
> emory.
>> HINT: In a moment you should be able to reconnect to the database and repeat your command.
>> 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
>

Hmmm, I see you're using SSL. I don't think that could break affect
anything, but maybe I should try mimicking this aspect too.

>> I think one hypothesis is that the standby starts fetching a WAL segment
>> from the primary, and while doing that it gets recycled and overwritten
>> (so the standby gets bogus somewhere half-way through). That should not
>> be possible, of course, but perhaps there's something broken right after
>> a restart? A race condition or something like that?
>
> If I haven't missed anything on the hexdumps I posted, the logs of the
> master and the standby are the same until the position on
> pg_controldata. I cannot understand how it can fail for something
> later. Is it possible that this is triggered if the replication is
> interrupted on a certain point?
>

Which position? The controldata shows location of the last checkpoint,
which says where the replay should start. Or am I missing something?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2018-04-16 19:39:54 Re: Wrong Results from SP-GiST with Collations
Previous Message Tom Lane 2018-04-16 19:28:54 Re: Wrong Results from SP-GiST with Collations

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-04-16 19:58:19 Re: very slow queries when max_parallel_workers_per_gather is higher than zero
Previous Message Tom Lane 2018-04-16 19:20:57 Re: ON CONFLICT DO UPDATE for partitioned tables