Re: Standby corruption after master is restarted

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: emre(at)hasegeli(dot)com
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>, gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-17 12:11:21
Message-ID: 1da55c73-4bd1-f13e-2d4b-c4049ffd73f5@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 04/17/2018 10:55 AM, Emre Hasegeli wrote:
>> Can you check if the "incorrect" part of the WAL segment matches some
>> previous segment? Verifying that shouldn't be very difficult (just cut a
>> bunch of bytes using hexdump, compare to the incorrect data). Assuming
>> you still have the WAL archive, of course. That would tell us that the
>> corrupted part comes from an old recycled segment.
>
> I had found and saved the recycled WAL file from the archive after the
> incident. Here is the hexdump of it at the same position:
>
> 0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164
> 0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50
> 0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52
> 0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045
> 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000
> 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34
> 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66
> 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e
>
> If you compare it with the other 2 I have posted, you would notice
> that the corrupted file on standby is combination of the two. The
> data on it starts with the data on the master, and continues with the
> data of the recycled file. The switch is at the position 0bddff8
> which is the position printed as "Minimum recovery ending location" by
> pg_controldata.
>

OK, this seems to confirm the theory that there's a race condition
between segment recycling and replicating. It's likely limited to short
period after a crash, otherwise we'd probably see many more reports.

But it's still just hunch - someone needs to read through the code and
check how it behaves in these situations. Not sure when I'll have time
for that.

>> Hmmm, I see you're using SSL. I don't think that could break affect
>> anything, but maybe I should try mimicking this aspect too.
>
> This is the connection information. Although the master shows SSL
> compression is disabled in despite of being explicitly asked for.
>
>> primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1'

Hmmm, that seems like a separate issue. When you say 'master shows SSL
compression is disabled' where do you see that?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-04-17 13:35:06 BUG #15161: libpq - Invalid SSPI context after PQreset
Previous Message PG Bug reporting form 2018-04-17 09:40:50 BUG #15160: planner overestimates number of rows in join when there are more than 200 rows coming from CTE

Browse pgsql-hackers by date

  From Date Subject
Next Message David Arnold 2018-04-17 12:31:24 Re: Proposal: Adding json logging
Previous Message John Naylor 2018-04-17 12:10:30 remove quoting hacks and simplify bootscanner.l