Logical Replication disconnecting and taking a while to return back.

From: Lucio Chiessi <lucio(dot)chiessi(at)trustly(dot)com>
To: PGSql Admin list <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Logical Replication disconnecting and taking a while to return back.
Date: 2024-07-08 17:17:20
Message-ID: CADoTbHXOifF33zdn2Lg_Qu6Q7=3b=VeQdoZF6QCg6mG6waaB2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi! I hope you are well.
I'm trying to understand a (strange, I think) behavior here when we using
logical replication in PostgreSQL 14.11. This only happens when
using the *streaming
= on* parameter on the subscription. If the value is *off*, the problem
doesn't happen.
This happens in the AWS RDS PostgreSQL 14.11, but I reproduced the same
behavior in my local environment using MacOS and Docker (Rancher).
Below, are the steps that are used to reproduce this behavior:

- Create a logical replication using a publication/subscription pair
between two different PostgreSQL 14.11 instances on a different host, using
the option *streaming = on* in the subscription;
- Do a large update in a long transaction on a table or tables
controlled by this subscription. You must do a huge update, that will
cause an autovacuum, with an index cleanup phase, after the transaction
commit;
- In this case, the replication slot will go down and take a while to
return back. I will show some pieces of logs below;
- This appears on the *publicator's log file* when replication slot
stops, by a walsender timeout:

*2024-06-27 10:47:19 UTC:10.65.xx.131(24376):replicator(at)DB20:[17466]:LOG:
terminating walsender process due to replication timeout 2024-06-27
10:47:19 UTC:10.65.xx.131(24376):replicator(at)DB20:[17466]:STATEMENT:
START_REPLICATION SLOT "subs_rept_opr1_02" LOGICAL 1AA6/3D6BC570
(proto_version '2', streaming 'on', publication_names
'"publi_rept_opr1_02"') 2024-06-27 10:47:19
UTC:10.65.xx.131(24376):replicator(at)DB20:[17466]:LOG: disconnection:
session time: 1650:28:29.615 user=replicator database=DB20
host=10.65.xx.131 port=24376 *
- At this time, we can't see any messages on the subscriber log
files. The messages only appear on subscriber log files when
replication
comes back online again, like the subscriber piece of log below:

*2024-06-27 12:26:01 UTC::@:[25739]:ERROR: could not receive data from WAL
stream: SSL connection has been closed unexpectedly 2024-06-27 12:26:01
UTC::@:[396]:LOG: background worker "logical replication worker" (PID
25739) exited with exit code 1 2024-06-27 12:26:01 UTC::@:[7325]:LOG:
logical replication apply worker for subscription
"subs_rept_opr1_02" has
started 2024-06-27 12:39:49 UTC::@:[7325]:ERROR: could not receive data
from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before
or while processing the request. server closed the connection
unexpectedly This probably means the server terminated
abnormally before or while processing the
request.
server closed the connection unexpectedly This probably
means the server terminated abnormally before or while
processing the request. 2024-06-27 12:39:49 UTC::@:[396]:LOG: background
worker "logical replication worker" (PID 7325) exited with exit code 1
2024-06-27 12:39:49 UTC::@:[8881]:LOG: logical replication apply worker
for subscription "subs_rept_opr1_02" has started*
- Remember this only happens if the option *streaming = on* in the
related subscription. I tested the same behavior using *streaming = off*
(default) and the problem didn't happen.
- I'm not sure, and I don't know if this can be related to the
autovacuum process running after the long update commit. The picture below
shows the autovacuum phases (on our monitoring tool) in the updated table.
[image: image.png]
- Logical replication remains stopped from 2024-06-27 10:47:19 UTC
to 2024-06-27 12:26:01 UTC, without any new messages in the log files;
- At *12:39:49 UTC we had problems again, but the logical replication
recovery was shorter;*
- No replication data was lost in the process.

I would appreciate if you could help or advise me on understanding what
happened here.
All my best!!

*Lucio Chiessi*

*Staff Database Administrator *

*Trustly, Inc.*

M: +55 27 996360276

<https://www.linkedin.com/company/trustly/>
<https://www.facebook.com/trustly> <https://twitter.com/Trustly>

<https://www.linkedin.com/company/trustly/>
<https://www.facebook.com/trustly> <https://twitter.com/Trustly>

<https://www.trustly.net/>

--
Please read our privacy policy here
<https://www.trustly.net/about-us/privacy-policy> on how we process your
personal data in accordance with the General Data Protection Regulation
(EU) 2016/679 (the “GDPR”) and other applicable data protection legislation.

Browse pgsql-admin by date

  From Date Subject
Next Message niraj nandane 2024-07-09 03:16:06 Re: How to restrict schema size per tenant
Previous Message Peter Adlersburg 2024-07-08 16:40:41 Re: Retrieve "CREATE FOREIGN SERVER" with pg_dump ... --schema=