Re: Postgresql 11: terminating walsender process due to replication timeout

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: abhishek(dot)bhola(at)japannext(dot)co(dot)jp
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Postgresql 11: terminating walsender process due to replication timeout
Date: 2021-09-09 06:56:35
Message-ID: 20210909.155635.1680635236525675012.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

At Thu, 9 Sep 2021 14:52:25 +0900, Abhishek Bhola <abhishek(dot)bhola(at)japannext(dot)co(dot)jp> wrote in
> I have found some questions about the same error, but didn't find any of
> them answering my problem.
>
> The setup is that I have two Postgres11 clusters (A and B) and they are
> making use of publication and subscription features to copy data from A to
> B.
>
> A (source DB- publication) --------------> B (target DB - subscription)
>
> This works fine, but often (not always) when the data volume being inserted
> on a table in node A increases, it gives the following error.
>
> "terminating walsender process due to replication timeout"
>
> The data volume at the moment being entered is about 30K rows per second
> continuously for hours through COPY command.
>
> Earlier the wal_sender_timeout was set to 5 sec and I would see this error
> much often. I then increased it to 1 min and the frequency of this error
> reduced. But I don't want to keep increasing it without understanding what
> is causing it. I looked at the code of walsender.c and know the exact lines
> where it's coming from.
>
> But I am still not clear which parameter is making the sender assume that
> the receiver node is inactive and therefore it should stop the wal_sender.
>
> Can anyone please suggest what changes I should make to remove this error?

What minor-version is the Postgres server mentioned? PostgreSQL 11
have gotten the following fix at 11.6, which could be related to the
trouble.

https://www.postgresql.org/docs/11/release-11-6.html

> Fix timeout handling in logical replication walreceiver processes
> (Julien Rouhaud)
>
> Erroneous logic prevented wal_receiver_timeout from working in
> logical replication deployments.

The details of the fix is here.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f60f690fac1bf375b92cf2f8682e8fe8f69098
> Fix timeout handling in logical replication worker
>
> The timestamp tracking the last moment a message is received in a
> logical replication worker was initialized in each loop checking if a
> message was received or not, causing wal_receiver_timeout to be ignored
> in basically any logical replication deployments. This also broke the
> ping sent to the server when reaching half of wal_receiver_timeout.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Abhishek Bhola 2021-09-09 07:06:25 Re: Postgresql 11: terminating walsender process due to replication timeout
Previous Message Avi Weinberg 2021-09-09 05:53:58 Subscriber to Get Only Some of The Tables From Publisher