Quick Links

Re: Postgresql 11: terminating walsender process due to replication timeout

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	abhishek(dot)bhola(at)japannext(dot)co(dot)jp
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Postgresql 11: terminating walsender process due to replication timeout
Date:	2021-09-13 01:10:41
Message-ID:	20210913.101041.2246839776609019379.horikyota.ntt@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

At Fri, 10 Sep 2021 16:55:48 +0900, Abhishek Bhola <abhishek(dot)bhola(at)japannext(dot)co(dot)jp> wrote in
> So is there any solution to this issue?
> I did try to increase the wal_sender_timeout and it broke the pub/sub.
> I increased the wal_receiver_timeout and it wouldn't attempt to restart the
> subscription until that time elapsed.
> Due to that, the WAL segments got removed by the time it came up again and
> it stopped working.
> So given that the publisher is publishing at a higher rate than the
> subscriber is subscribing, what can be done?

Given that my assumption is right, to enable a subscriber to send a
response, the subscriber needs to see a keepalive packet from
publisher (sent with intervals of wal_sender_timeout/2) within every
interval of wal_sender_timeout. Otherwise needs to "rest" by finding a
gap in the data stream from the publisher with intervals shorter than
wal_sender_timeout.

The reason subscriber is kept busy is it receives the next data
before it finishes applying the previous data. So possible workaround
I came up with for now are:

- Increase processing power of the subscriber, so as to increase the
possibility that it can finish applying changes before the next data
block comes from the publisher. Or, to make the subscriber can keep
catching up to the publisher. This is the most appropriate solution,
I think.

- Throttle network bandwidth to obtain the same effect to the first
reason above. (This may give a side effect that the bandwidth become
finally insufficient.)

- Break large transactions on the publisher into smaller pieces.
Publisher sends data of a transaction at once at transaction commit,
so this could average data transfer rate.

- If you are setting *sufficient* logical_decoding_work_mem for such
problematic large transactions, *decreasing* it might mitigate the
issue. Lower logical_decoding_work_mem cause transaction data spill
out to disk and the spilled data on disk could be sent at slower
rate than on-memory data. Of course this is in exchange with total
performance.

- Streaming mode of logical replication introduced in PostgreSQL 14
might be able to mitigate the problem. It starts sending
transaction data before the transaction completes.

I'm not sure this is "fixed" for 13 or earlier, because a straight
forward resolution surely decreases maximum processing rate at
subscriber.

> On Fri, Sep 10, 2021 at 9:26 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
> wrote:
>
> > At Thu, 9 Sep 2021 16:06:25 +0900, Abhishek Bhola <
> > abhishek(dot)bhola(at)japannext(dot)co(dot)jp> wrote in
> > > sourcedb:~$ postgres --version
> > > postgres (PostgreSQL) 11.6
> > >
> > > Sorry for missing this information.
> > > But looks like this fix is already included in the version I am running.
> >
> > Ok. I'm not sure but there may be a case where too-busy (or too poor
> > relative to the publisher) subscriber cannot send a response for a
> > long time. Usually keep-alive packets sent from publisher causes
> > subscriber response even while busy time but it seems that if
> > subscriber applies changes more than two times slower than the
> > publisher sends, subscriber doesn't send a response in the timeout
> > window.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: Postgresql 11: terminating walsender process due to replication timeout at 2021-09-10 07:55:48 from Abhishek Bhola

Browse pgsql-general by date

	From	Date	Subject
Next Message	Herwig Goemans	2021-09-13 11:23:06	ERROR: control reached end of function without RETURN, except, it is not a function it is a procedure or a nameless block.
Previous Message	Laurenz Albe	2021-09-12 18:21:52	Re: Unable to drop invalid TOAST indexes