Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

From: Rui DeSousa <rui(at)crazybean(dot)net>
To: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"
Date: 2018-11-15 14:48:09
Message-ID: 52EE6BE4-73B4-4BDB-9CDF-1F1C099BA0AD@crazybean.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

> On Nov 15, 2018, at 2:20 AM, Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> wrote:
>
> On 14/11/18 6:24 μ.μ., Rui DeSousa wrote:
>>
>>> On Nov 14, 2018, at 3:31 AM, Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> wrote:
>>>
>>> Our sysadms (seasoned linux/network guys : we have been working here for more than 10 yrs) were absolute in that we run no firewall or other traffic shaping system between the two hosts. (if we did the problem would manifest itself earlier). Can you recommend what to look for exactly regarding both TCP stacks ? The subscriber node is a clone of the primary. We have :
>>>
>>> # sysctl -a | grep -i keepaliv
>>> net.ipv4.tcp_keepalive_intvl = 75
>>> net.ipv4.tcp_keepalive_probes = 9
>>> net.ipv4.tcp_keepalive_time = 7200
>>>
>> Those keep alive settings are linux’s defaults and work out to be 18 hours before the abandon connection is dropped. So, the WAL receiver should have corrected itself after that time. For reference, I run terminating abandon session within 15 mins as they take-up valuable database resources and could potentially hold on to locks, snapshots, etc.
>
> Didn't happen. Friday 15:01 + '18 hrs' = Saturday 09:01 . The subscription was stuck on Friday 15:01, unfortunately I didn't keep the whole output of the pg_stat_subscription (which I took on Sunday morning). And in Sunday morning the primary, having run out of space, PANIC'ed .

Apparently TCP keep alive is not enabled; looks like we’re missing the following but don’t look for it :). So the connection would stick around until it gets a reset packet which could be days if not longer especially if packets are being dropped. I know you say there is no firewall; but the upstream server would send a reset packet if connection was terminated and/or Postgres was down — so there is something dropping packets.

net.inet.tcp.always_keepalive=1

A quick google and it looks like Linux defaults to not enabling keep alive whereas FreeBSD enables it by default and globally regardless of application request. For Linux, Postgres will need to request it. You will need to setup the keep alive parameters in the Postgres configuration and restart the server.

The keep alive setup will allow WAL receiver to detect the broken connection resulting in it terminating the current connection and attempt to establish a new connection.

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message pgsql-admin 2018-11-15 18:18:02 The current shape of PG master-slave replication
Previous Message Achilleas Mantzios 2018-11-15 07:20:54 Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"