Re: streaming replication - crash on standby

From: "Seong Son (US)" <Seong(dot)Son(at)datapath(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: streaming replication - crash on standby
Date: 2017-08-11 21:56:49
Message-ID: BY2PR17MB0328763A8900085A498A819684890@BY2PR17MB0328.namprd17.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>-----Original Message-----
>From: Andres Freund [mailto:andres(at)anarazel(dot)de]
>Sent: Wednesday, August 09, 2017 6:34 PM
>To: Seong Son (US) <Seong(dot)Son(at)datapath(dot)com>
>Cc: pgsql-general(at)postgresql(dot)org
>Subject: Re: [GENERAL] streaming replication - crash on standby
>
>Hi,
>
>Please quote properly on postgres mailing lists.
>
>On 2017-08-09 22:31:23 +0000, Seong Son (US) wrote:
>> I see. Thank you.
>>
>> But the Postgresql process had crashed at that time so the streaming replication was no longer working. Why would it crash and is that normal?
>
>You've given us absolutely zero information to be able to diagnose the problem. If you want somebody to help you you'll have to describe exactly what happened, and what the problem you're facing is.
>
>- Andres

Sorry for lack of info. I've gathered some more info. Hopefully it would be enough to help isolate the cause of the crash of the standby server.

The servers are on Windows Server 2012 R2. Postgresql 9.6. Primary and standby servers are in two different cities connected over VPN.

Here's the last few lines from pg_log at the time of the strandby server's crash:

2017-08-08 21:17:56 UTC FATAL: invalid memory alloc request size 1656315904
2017-08-08 21:17:56 UTC LOG: startup process (PID 2972) exited with exit code 1
2017-08-08 21:17:56 UTC LOG: terminating any other active server processes
2017-08-08 21:17:56 UTC WARNING: terminating connection because of crash of another server process
2017-08-08 21:17:56 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-08-08 21:17:56 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-08-08 21:17:56 UTC WARNING: terminating connection because of crash of another server process
2017-08-08 21:17:56 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-08-08 21:17:56 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2017-08-08 21:17:56 UTC LOG: database system is shut down

And this is the last entry from pg_xlogdump:

-08 21:17:36.864852 Coordinated Universal Time
pg_xlogdump: FATAL: error in WAL record at DF/4CB95FD0: unexpected pageaddr DB/62B96000 in log segment 00000000000000DF0000004C, offset 12148736

One thing I noticed is that the network is not the most stable. When I ran wireshark capture on port 5432, I saw numerous errors and warning like
"New fragment overlaps old data (retransmission?)"
"This frame is a (suspected) out-of-order segment"
"This frame is a (suspected) retransmission"

So the questions are, why did the standby server crash? Could the network instability be the cause for the crash?

Thank you in advance for any info.
Seong

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Igor Korot 2017-08-13 14:00:23 Where is pg_hba.conf
Previous Message Jeff Janes 2017-08-11 21:29:12 Re: How to make server generate more output?