Re: postgres on physical replica crashes

From: SRINIVASARAO OGURI <srinioraclepostgres(at)gmail(dot)com>
To: Hannes Erven <hannes(at)erven(dot)at>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, greigwise(at)comcast(dot)net
Subject: Re: postgres on physical replica crashes
Date: 2018-05-07 10:31:07
Message-ID: CAO69RdgCxp5Yq6KaopcdgLCHkh28eTKtJAk5TCW6x5Q1c_i9pw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Greig Wise,

If you are using CentOS/REDHAT - 07 , check this link "
https://srinivasoguri.blogspot.in/2018/04/postgresql-crash-in-centosredhat-07.html
"

On Fri, Apr 20, 2018 at 6:58 PM, Hannes Erven <hannes(at)erven(dot)at> wrote:

> Hi Greig,
>
>
> just last week I experienced the same situation as you on a 10.3 physical
> replica (it even has checksums activated), and a few months ago on 9.6 .
> We used the same resolution as you we, and so far we haven't noticed any
> problems with data integrity on the replicas.
>
>
>
> The logs were as follows:
> 2018-04-13 06:31:16.947 CEST [15603] FATAL: WAL-Receiver-Prozess wird
> abgebrochen wegen Zeitüberschreitung
> 2018-04-13 06:31:16.948 CEST [15213] FATAL: invalid memory alloc request
> size 4280303616
> 2018-04-13 06:31:16.959 CEST [15212] LOG: Startprozess (PID 15213)
> beendete mit Status 1
> 2018-04-13 06:31:16.959 CEST [15212] LOG: aktive Serverprozesse werden
> abgebrochen
> 2018-04-13 06:31:16.959 CEST [19838] user(at)db WARNUNG: Verbindung wird
> abgebrochen wegen Absturz eines anderen Serverprozesses
> 2018-04-13 06:31:16.959 CEST [19838] user(at)db DETAIL: Der Postmaster hat
> diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen
> und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal beendet
> wurde und möglicherweise das Shared Memory verfälscht hat.
> 2018-04-13 06:31:16.959 CEST [19838] user(at)db TIPP: In einem Moment
> sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl wiederholen
> können.
>
>
> This replica then refused to start up:
> 2018-04-13 09:25:15.941 CEST [1957] LOG: Standby-Modus eingeschaltet
> 2018-04-13 09:25:15.947 CEST [1957] LOG: Redo beginnt bei 1C/69C0FF30
> 2018-04-13 09:25:15.951 CEST [1957] LOG: konsistenter
> Wiederherstellungszustand erreicht bei 1C/69D9A9C0
> 2018-04-13 09:25:15.952 CEST [1956] LOG: Datenbanksystem ist bereit, um
> lesende Verbindungen anzunehmen
> 2018-04-13 09:25:15.953 CEST [1957] FATAL: invalid memory alloc request
> size 4280303616
> 2018-04-13 09:25:15.954 CEST [1956] LOG: Startprozess (PID 1957) beendete
> mit Status 1
>
>
> ... until the WAL files from the hot standby's pg_wal were manually
> removed and re-downloaded from the primary.
>
> Unfortunately I did not collect hard evidence, but I think I saw the
> primary's replication slot's restart point was set to a position /after/
> the standby's actual restart location. This time, the error was noticed
> immediately and the required WAL was still present on the master.
>
>
> A few months ago I experienced the same situation on a 9.6 cluster, but
> that was not noticed for a long time, and - despite using a replication
> slot! - the primary had already removed required segments. Fortunately I
> could get them from a tape backup...
>
>
>
> Best regards,
>
> -hannes
>
>
>
>
> Am 2018-04-18 um 18:16 schrieb greigwise:
>
>> Hello. I've had several instances where postgres on my physical replica
>> under version 9.6.6 is crashing with messages like the following in the
>> logs:
>>
>> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL: The
>> postmaster
>> has commanded this server process to roll back the current transaction and
>> exit, because another server process exited abnormally and possibly
>> corrupted shared memory.
>> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT: In a moment you
>> should be able to reconnect to the database and repeat your command.
>> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING: terminating
>> connection because of crash of another server process
>> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL: The
>> postmaster
>> has commanded this server process to roll back the current transaction and
>> exit, because another server process exited abnormally and possibly
>> corrupted shared memory.
>> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT: In a moment you
>> should be able to reconnect to the database and repeat your command.
>> 2018-04-18 05:43:27 UTC 5acf5e12.6819 LOG: database system is shut down
>>
>> When this happens, what I've found is that I can go into the pg_xlog
>> directory on the replica, remove all the log files and the postgres will
>> restart and things seem to come back up normally.
>>
>> So, the question is what's going on here... is the log maybe getting
>> corrupt
>> in transmission somehow? Should I be concerned about the viability of my
>> replica after having restarted in the described fashion?
>>
>> Thanks,
>> Greig Wise
>>
>>
>>
>> --
>> Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f184378
>> 0.html
>>
>>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Igor Korot 2018-05-07 11:38:26 Re: Add schema to the query
Previous Message Markus 2018-05-07 09:12:23 Re: Query planner riddle (array-related?)