Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"
Date: 2018-11-17 11:07:18
Message-ID: 356b8aba-6ce1-567c-e0fb-9660bdfc7ebe@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


On 16/11/18 5:29 μ.μ., Rui DeSousa wrote:
>
>
>> On Nov 16, 2018, at 3:18 AM, Achilleas Mantzios
>> <achill(at)matrix(dot)gatewaynet(dot)com <mailto:achill(at)matrix(dot)gatewaynet(dot)com>>
>> wrote:
>>
>>> net.inet.tcp.always_keepalive=1
>>
>> This setting is from FreeBSD. I have tested changing the settings on
>> my PostgreSQL 11.1 on my FreeBSD 11.2-RELEASE-p3, and this would have
>> no effect at all to the postgresql settings, they remained all three
>> of them at zero. This is completely irrelevant with my problem but
>> anyway.
>>
>
> That is what I stated; you don’t need it.  It is that in Linux the
> application has to enable it and I don’t know of a kernel setting for
> Linux like the one in FreeBSD

You may read the PostgreSQL backend sources (grep for SO_KEEPALIVE), the
code supports KEEPALIVE.

>
>>>
>>> A quick google and it looks like Linux defaults to not enabling keep
>>> alive whereas FreeBSD enables it by default and globally regardless
>>> of application request.  For Linux, Postgres will need to request
>>> it. You will need to setup the keep alive parameters in the Postgres
>>> configuration and restart the server.
>>
>> http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
>> So according to the official Linux docs, three are the parameters
>> that govern TCP keepalive in Linux, which in both the said systems
>> are set as :
>> root(at)TEST-smadb:/var/lib/pgsql# sysctl -a | grep keep
>> net.ipv4.tcp_keepalive_intvl = 75
>> net.ipv4.tcp_keepalive_probes = 9
>> net.ipv4.tcp_keepalive_time = 7200
>> root(at)TEST-smadb:/var/lib/pgsql#
>>
>
> That does not mean the connection has TCP keep alive enabled; it just
> means that if an application requests it those would be the defaults
> setting if it doesn’t provide its own.  Those setting would be too
> large anyway; you want to be able to detect a broken connection much
> quicker than 18 hours.

I checked on a bare minimal default installation, (after tweaking the
kernel tunables to smaller values of course), keepalive msgs are sent
and ACK'ed at the specified intervals, checked with wireshark, port
5432. You should test this yourself.

>
>>>
>>> The keep alive setup will allow WAL receiver to detect the broken
>>> connection resulting in it terminating the current connection and
>>> attempt to establish a new connection.
>>
>> So from looks of this, keep alive is enabled. (Also don't confuse WAL
>> receiver with logical worker, different programs, albeit similar).
>
> I don’t believe it’s enabled; have you check to see that you getting
> keep alive packets?  If it was enabled it would have terminated after
> 18 hours.

See above. In the meantime, I would be nice if someone from the hackers
would chime in to clear things up, just to be sure.

Which means, that since PostgreSQL *supports* KEEPALIVE and the logical
worker kept happy like nothing happened, then I guess *something* was
mocking the KEEPALIVE ACKs??????

>
>> Is there any way (by network means?) to mock this behavior in order
>> to fool the replication worker like the sender is there?
>
> Put a firewall in-between the servers and drop the packets without
> sending resets.
>
>
> Have a read here:
>
> Section 4.2
>
> http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/
>
> The RFC states TCP keep alive should be off by default; FreeBSD
> changed that back in 1999 and I believe Linux still follows the RFC:
>
> https://serverfault.com/questions/671710/why-does-freebsd-net-inet-tcp-always-keepalive-violate-rfc1122#671749
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Mariel Cherkassky 2018-11-17 11:31:20 Re: checkpoint occurs very often when vacuum full running
Previous Message Achilleas Mantzios 2018-11-17 10:51:42 Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"