Re: Help diagnosing replication (copy) error

From: Jeff Ross <jross(at)openvistas(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Help diagnosing replication (copy) error
Date: 2024-03-09 00:06:20
Message-ID: f39e6929-c290-4f08-bcdc-fe409c740fd7@openvistas.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 3/8/24 14:50, Steve Baldwin wrote:

> Hi,
>
> I'm in the process of migrating a cluster from 15.3 to 16.2. We have a
> 'zero downtime' requirement so I'm using logical replication to create
> the new cluster and then perform the switch in the application.
>
> I have a situation where all but one table have done their initial
> copy. The remaining table is the largest (of course), and the
> replication slot that is assigned for the copy
> (pg_378075177_sync_60067_7343845372910323059) is showing as
> 'active=false' if I select from pg_replication_slots on the publisher.
>
> I've checked the recent logs for both the publishing cluster and the
> subscribing cluster but I can't see any replication errors. I guess I
> could have missed them, but it doesn't seem like anything is being
> 'retried' like I've seen in the past with replication errors.
>
> I've used this mechanism for zero-downtime upgrades multiple times in
> the past, and have recently used it to upgrade smaller clusters from
> 15.x to 16.2 without issue.
>
> The clusters are hosted on AWS RDS, so I have no access to the
> servers, but if that's the only way to diagnose the issue, I can
> create a support case.
>
> Does anyone have any suggestions as to where I should look for the issue?
>
> Thanks,
>
> Steve

In our setup we're logically replicating a 450G database hosted on real
hardware to an RDS instance.

Multiple times we've had replication simply stop and we could never find
any reason for that on either publisher or subscriber.

The *only* solution that ever worked in these cases was dropping the
subscription in RDS and re-creating it with (copy_data = false).

At that point replication picks right up again for new transactions
*but* at the expense of losing all of the WAL that should have been
replicated during the outage.  I wrote a python based "logical
replication fixer" to fill in those gaps.

Given that the subscriber is the one that initiates the connection to
the publisher and that as soon as the subscription is dropped and
restarted replication resumes my hunch is that this is squarely on RDS. 
With both publisher and subscriber on RDS as in your case YMMV.

RDS is a black box--who knows what's really going on there?  It would be
interesting to see what the response is after you open a support case. 
I hope you'll be able to share that with the list.

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message jian he 2024-03-09 01:13:34 Re: Emitting JSON to file using COPY TO
Previous Message Steve Baldwin 2024-03-08 22:42:13 Re: Help diagnosing replication (copy) error