Quick Links

BUG? Slave don't reconnect to the master

From:	Олег Самойлов <splarv(at)ya(dot)ru>
To:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	BUG? Slave don't reconnect to the master
Date:	2020-08-18 10:48:41
Message-ID:	60590EC6-4062-4F25-A49C-3948ED2A7D47@ya.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi all.

I found some strange behaviour of postgres, which I recognise as a bug. First of all, let me explain situation.

I created a "test bed" (not sure how to call it right), to test high availability clusters based on Pacemaker and PostgreSQL. The test bed consist of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 HA clusters with different structure. And all 4 HA cluster constantly tested in loop: simulated failures with different nature, waited for rising fall-over, fixing, and so on. For simplicity I'll explain only one HA cluster. This is 3 virtual machines, with master on one, and sync and async slaves on other. The PostgreSQL service is provided by float IPs pointed to working master and slaves. Slaves are connected to the master float IP too. When the pacemaker detects a failure, for instance, on the master, it promote a master on other node with lowest latency WAL and switches float IPs, so the third node keeping be a sync slave. My company decided to open this project as an open source, now I am finishing formality.

Almost works fine, but sometimes, rather rare, I detected that a slave don't reconnect to the new master after a failure. First case is PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to simulate freeze. The slave don't reconnect to the new master with errors in log:

18:02:56.236 [3154] FATAL: terminating walreceiver due to timeout
18:02:56.237 [1421] LOG: record with incorrect prev-link 0/1600DDE8 at 0/1A00DE10

What is strange that error about incorrect WAL is risen after the termination of connection. Well, this can be workarouned by turning off wal receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still exists with other test. ForkBomb simulates an out of memory situation. In this case a slave sometimes don't reconnect to the new master too, with errors in log:

10:09:43.99 [1417] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
10:09:43.992 [1413] LOG: invalid record length at 0/D8014278: wanted 24, got 0

The last error message (last row in log) was observed different, btw.

What I expect as right behaviour. The PostgreSQL slave must reconnect to the master IP (float IP) after the wal_retrieve_retry_interval.

Responses

Re: BUG? Slave don't reconnect to the master at 2020-08-19 13:07:13 from Jehan-Guillaume de Rorthais

Browse pgsql-general by date

	From	Date	Subject
Next Message	Ron	2020-08-18 11:28:00	Re: Point in time recovery
Previous Message	Daulat Ram	2020-08-18 10:10:58	Point in time recovery