Re: postgres on physical replica crashes

From: Hannes Erven <hannes(at)erven(dot)at>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Cc: greigwise(at)comcast(dot)net
Subject: Re: postgres on physical replica crashes
Date: 2018-04-20 13:28:06
Message-ID: c433628d-35ec-3273-34fc-ce9a171b400b@erven.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Greig,

just last week I experienced the same situation as you on a 10.3
physical replica (it even has checksums activated), and a few months ago
on 9.6 .
We used the same resolution as you we, and so far we haven't noticed any
problems with data integrity on the replicas.

The logs were as follows:
2018-04-13 06:31:16.947 CEST [15603] FATAL: WAL-Receiver-Prozess wird
abgebrochen wegen Zeitüberschreitung
2018-04-13 06:31:16.948 CEST [15213] FATAL: invalid memory alloc
request size 4280303616
2018-04-13 06:31:16.959 CEST [15212] LOG: Startprozess (PID 15213)
beendete mit Status 1
2018-04-13 06:31:16.959 CEST [15212] LOG: aktive Serverprozesse werden
abgebrochen
2018-04-13 06:31:16.959 CEST [19838] user(at)db WARNUNG: Verbindung wird
abgebrochen wegen Absturz eines anderen Serverprozesses
2018-04-13 06:31:16.959 CEST [19838] user(at)db DETAIL: Der Postmaster hat
diesen Serverprozess angewiesen, die aktuelle Transaktion zurückzurollen
und die Sitzung zu beenden, weil ein anderer Serverprozess abnormal
beendet wurde und möglicherweise das Shared Memory verfälscht hat.
2018-04-13 06:31:16.959 CEST [19838] user(at)db TIPP: In einem Moment
sollten Sie wieder mit der Datenbank verbinden und Ihren Befehl
wiederholen können.

This replica then refused to start up:
2018-04-13 09:25:15.941 CEST [1957] LOG: Standby-Modus eingeschaltet
2018-04-13 09:25:15.947 CEST [1957] LOG: Redo beginnt bei 1C/69C0FF30
2018-04-13 09:25:15.951 CEST [1957] LOG: konsistenter
Wiederherstellungszustand erreicht bei 1C/69D9A9C0
2018-04-13 09:25:15.952 CEST [1956] LOG: Datenbanksystem ist bereit, um
lesende Verbindungen anzunehmen
2018-04-13 09:25:15.953 CEST [1957] FATAL: invalid memory alloc request
size 4280303616
2018-04-13 09:25:15.954 CEST [1956] LOG: Startprozess (PID 1957)
beendete mit Status 1

... until the WAL files from the hot standby's pg_wal were manually
removed and re-downloaded from the primary.

Unfortunately I did not collect hard evidence, but I think I saw the
primary's replication slot's restart point was set to a position /after/
the standby's actual restart location. This time, the error was noticed
immediately and the required WAL was still present on the master.

A few months ago I experienced the same situation on a 9.6 cluster, but
that was not noticed for a long time, and - despite using a replication
slot! - the primary had already removed required segments. Fortunately I
could get them from a tape backup...

Best regards,

-hannes

Am 2018-04-18 um 18:16 schrieb greigwise:
> Hello. I've had several instances where postgres on my physical replica
> under version 9.6.6 is crashing with messages like the following in the
> logs:
>
> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser DETAIL: The postmaster
> has commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2018-04-18 05:43:26 UTC dbname 5acf5e4a.6918 dbuser HINT: In a moment you
> should be able to reconnect to the database and repeat your command.
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser WARNING: terminating
> connection because of crash of another server process
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser DETAIL: The postmaster
> has commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2018-04-18 05:43:26 UTC dbname 5acf5e39.68e5 dbuser HINT: In a moment you
> should be able to reconnect to the database and repeat your command.
> 2018-04-18 05:43:27 UTC 5acf5e12.6819 LOG: database system is shut down
>
> When this happens, what I've found is that I can go into the pg_xlog
> directory on the replica, remove all the log files and the postgres will
> restart and things seem to come back up normally.
>
> So, the question is what's going on here... is the log maybe getting corrupt
> in transmission somehow? Should I be concerned about the viability of my
> replica after having restarted in the described fashion?
>
> Thanks,
> Greig Wise
>
>
>
> --
> Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2018-04-20 13:55:21 Re: Problem with trigger makes Detail record be invalid
Previous Message Fabio Pardi 2018-04-20 09:48:38 Re: pg_upgrade help