From: | Rui DeSousa <rui(at)crazybean(dot)net> |
---|---|
To: | Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> |
Cc: | pgsql-admin(at)lists(dot)postgresql(dot)org |
Subject: | Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device" |
Date: | 2018-11-15 14:48:09 |
Message-ID: | 52EE6BE4-73B4-4BDB-9CDF-1F1C099BA0AD@crazybean.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
> On Nov 15, 2018, at 2:20 AM, Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> wrote:
>
> On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:
>>
>>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> wrote:
>>>
>>> Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we run no firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself earlier). Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the primary. We have :
>>>
>>> # sysctl -a | grep -i keepaliv
>>> net.ipv4.tcp_keepalive_intvl = 75
>>> net.ipv4.tcp_keepalive_probes = 9
>>> net.ipv4.tcp_keepalive_time = 7200
>>>
>> Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped. So, the WAL receiver should have corrected itself after that time. For reference, I run terminating abandon session within 15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.
>
> Didn't happen. Friday 15:01 + '18 hrs' = Saturday 09:01 . The subscription was stuck on Friday 15:01, unfortunately I didn't keep the whole output of the pg_stat_subscription (which I took on Sunday morning). And in Sunday morning the primary, having run out of space, PANIC'ed .
Apparently TCP keep alive is not enabled; looks like we’re missing the following but don’t look for it :). So the connection would stick around until it gets a reset packet which could be days if not longer especially if packets are being dropped. I know you say there is no firewall; but the upstream server would send a reset packet if connection was terminated and/or Postgres was down — so there is something dropping packets.
net.inet.tcp.always_keepalive=1
A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request. For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.
The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.
From | Date | Subject | |
---|---|---|---|
Next Message | pgsql-admin | 2018-11-15 18:18:02 | The current shape of PG master-slave replication |
Previous Message | Achilleas Mantzios | 2018-11-15 07:20:54 | Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device" |