Fwd: postgresql lost connection to repmgr arbitrarily

From: Zhaoxun Yan <yan(dot)zhaoxun(at)gmail(dot)com>
To: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Fwd: postgresql lost connection to repmgr arbitrarily
Date: 2023-10-18 09:41:01
Message-ID: CADEX6_WZmRx-7HmkCgaN_RHriRakYtPMo7USc=FjNimeRWED3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I found the postgresql process with this line:
postgres: rep repmgr 172.17.1.2(60490) idle

It represents the TCP connection from local address 172.17.1.2:60490 and
was labeled as "idle"
I checked a local connection to 172.17.1.2, which is the address of eth0,
It is a loopback connection just like localhost:
[root(at)yzx2 ~]# tracepath 172.17.1.2
1: yzx2 0.057ms reached
Resume: pmtu 65535 hops 1 back 1
Thus no router is involved in the repmgr-postgresql connection otherwise
mtu<=1500
Does the "idle" label mean something?

Forwarded Conversation
Subject: postgresql lost connection to repmgr arbitrarily
------------------------

From: Zhaoxun Yan <yan(dot)zhaoxun(at)gmail(dot)com>
Date: Tue, Oct 17, 2023 at 2:55 PM
To: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>

Hi!
It happens from time to time. At first I thought it was the router problem,
so I changed the host in repmgr's configuration from its intranet address
to 127.0.0.1, but it persists. Here is what happened according to repmgrd:

2023-10-17 00:21:50+0800: repmgrd_local_disconnect on node2, unable to
connect to local node - happened
2023-10-17 00:21:50.471347+08: repmgrd_local_reconnect on node2,
reconnected to local node after 0 seconds - happened
2023-10-17 10:23:05+0800: repmgrd_local_disconnect on node2, unable to
connect to local node - happened
2023-10-17 10:23:05.473997+08: repmgrd_local_reconnect on node2,
reconnected to local node after 0 seconds - happened
2023-10-17 13:22:23+0800: repmgrd_local_disconnect on node2, unable to
connect to local node - happened
2023-10-17 13:22:23.552278+08: repmgrd_local_reconnect on node2,
reconnected to local node after 0 seconds - happened

So I get to the postgresql side and its log reports:
2023-10-17 00:21:50.415 CST [2210264] LOG: could not receive data from
client: Connection reset by peer
2023-10-17 10:23:05.420 CST [2249257] LOG: could not receive data from
client: Connection reset by peer
2023-10-17 13:22:23.486 CST [2260546] LOG: could not receive data from
client: Connection reset by peer

I have set up keepalive feature in postgresql.conf to prevent router from
cutting off TCP connection:

tcp_keepalives_idle = 20

tcp_keepalives_interval = 10

tcp_keepalives_count = 3

So do you have any idea on what went wrong? BTW, postgresql version is 15.4
while repmgr version is 5.4dev.

----------
From: Scott Ribe <scott_ribe(at)elevated-dev(dot)com>
Date: Tue, Oct 17, 2023 at 9:39 PM
To: Zhaoxun Yan <yan(dot)zhaoxun(at)gmail(dot)com>
Cc: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>

I have no idea if this is related to your problem, but...

I once had a connection timeout where a big institution was using Cisco
routers, which charged ongoing license fees, tiered by how many connections
they would support. And they configured them to recognize keepalive
packets, and drop connections which only had keepalive packets for some
length of time!

----------
From: Zhaoxun Yan <yan(dot)zhaoxun(at)gmail(dot)com>
Date: Wed, Oct 18, 2023 at 10:25 AM
To: Scott Ribe <scott_ribe(at)elevated-dev(dot)com>
Cc: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>

Hi Scott,
To avoid the problem you mentioned, I have already changed the host
address to 127.0.0.1, meaning 'localhost', and the connection is only on
that machine, without via a router.

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Tomek 2023-10-18 12:40:36 Re: Table health
Previous Message Alexander Gesser 2023-10-18 09:30:14 connection timeout expired