From: | Dan shmidt <dshmidt(at)hotmail(dot)com> |
---|---|
To: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Logical replication stuck in catchup state |
Date: | 2020-06-10 06:15:48 |
Message-ID: | MN2PR02MB6447F56F5D85F758D8DD5436A4830@MN2PR02MB6447.namprd02.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Thank you very much for your replies.
Regarding the server logs, I didn't find anything but healthy log when the server start which says that it is going to recover from the same point in WAL which was last sent.
Regarding bugfixes, I will try to update ASAP - but wouldn't a restart of the server release the lock? Is there a way to release the lock manually?
Any other suggestion on how to recover from this state without upgrading?
Is there a way to restart the replication from scratch?
Sent from Outlook<http://aka.ms/weboutlook>
________________________________
From: Dan shmidt
Sent: Wednesday, June 10, 2020 12:30 AM
To: pgsql-general(at)postgresql(dot)org <pgsql-general(at)postgresql(dot)org>
Subject: Logical replication stuck in catchup state
Hi All,
We have a setup in which there are several master nodes replicating to a single slave/backup node. We are using Postgres 11.4.
Recently, one of the nodes seems to be stuck and stopped replicating.
I did some basic troubleshooting and couldn't find the root cause for that.
On one hand:
- the replication slot does seem to be active according to pg_replication_slots (Sorry no screenshot)
- on slave node it seems that last_msg_receipt_time is updating on pg_stat_subscription
On the other hand:
- on the slave node: received_lsn keeps pointing on the same wal segment (pg_stat_subscription)
- redo_lsn - restart_lsn shows ~20GB lag
According to logs on the master it seems that the sender hits a timeout, when trying to increase the wal_sender_timeout even to 0 (no timeout) - it doesn't have any effect. On the other hand, the last_msg_receipt_time is updated. How is that possible?
Screenshots attached. The stuck subscription/replication slot is the one ending with "53db6". On images with more than one row - it's the second one.
Any suggestions on what may be the root cause or how to continue debugging?
Appreciate your help.
Thank you,
Dan.
From | Date | Subject | |
---|---|---|---|
Next Message | Laurenz Albe | 2020-06-10 06:49:17 | Re: Help with plpython3u |
Previous Message | Ishan Joshi | 2020-06-10 06:05:52 | RE: Postgres server 12.2 crash with process exited abnormally and possibly corrupted shared memory |