Re: Local replication "slot does not exist" after initial sync

From: Justin <zzzzz(dot)graf(at)gmail(dot)com>
To: Mike Lissner <mlissner(at)michaeljaylissner(dot)com>
Cc: "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Local replication "slot does not exist" after initial sync
Date: 2024-02-25 19:40:52
Message-ID: CALL-XePzMZt3G-5A4kJivq9tvEjVsZT_0VSC1bB_U81yvThdiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Feb 25, 2024 at 1:11 PM Mike Lissner <mlissner(at)michaeljaylissner(dot)com>
wrote:

> Sorry, two more little things here. The publisher logs add much, but
> here's what we see:
>
> STATEMENT: START_REPLICATION SLOT
> "pg_20031_sync_17418_7324846428853951375" LOGICAL F1D0/346C6508
> (proto_version '2', publication_names '"compass_publication2"')
> ERROR: replication slot "pg_20031_sync_17402_7324846428853951375" does not
> exist
>
> And I thought that maybe there'd be some magic in the REFRESH command on
> the subscriber, so I tried that:
>
> alter subscription xyz refresh publication;
>
>
> To nobody's surprise, that didn't help. :)
>
>
> On Sun, Feb 25, 2024 at 10:00 AM Mike Lissner <
> mlissner(at)michaeljaylissner(dot)com> wrote:
>
>> Hi, I set up logical replication a few days ago, but it's throwing some
>> weird log lines that have me worried. Does anybody have experience with
>> lines like the following on a subscriber:
>>
>> LOG: logical replication table synchronization worker for subscription
>> "compass_subscription", table "search_opinionscitedbyrecapdocument" has
>> started
>> ERROR: could not start WAL streaming: ERROR: replication slot
>> "pg_20031_sync_17418_7324846428853951375" does not exist
>> LOG: background worker "logical replication worker" (PID 1014) exited
>> with exit code 1
>>
>> Slots with this kind of name (pg_xyz_sync_*) are created during the
>> initial sync, but it seems like the subscription is working based on a
>> quick look in a few tables.
>>
>> I thought this might be related to running out of slots on the publisher,
>> so I increased both max_replication_slots and max_wal_senders to 50 and
>> rebooted so those would take effect. No luck.
>>
>> I thought rebooting the subscriber might help. No luck.
>>
>> When I look in the publisher to see the slots we have...
>>
>> SELECT * FROM pg_replication_slots;
>>
>> ...I do not see the one that's missing according to the log lines.
>>
>> So it seems like the initial sync might have worked properly (tables have
>> content), but that I have an errant process on the subscriber that might be
>> stuck in a retry loop.
>>
>> I haven't been able to fix this, and I think my last attempt might be a
>> new subscription with copy_data=false, but I'd rather avoid that if I can.
>>
>> Is there a way to fix or understand this so that I don't get the log
>> lines forever and so that I can be confident the replication is in good
>> shape?
>>
>> Thank you!
>>
>>
>> Mike
>>
>
Hi Mike
This happens when there is an error with another sync worker, PG has
issues when this happens, the Logical Replication Main Worker attempts to
sync another table, creates a slots, then tries to sync that table, then
errors out as it is out of sync workers, then attempts to sync another
table. the Subscriber is then stuck in a loop using up all the slots on
the publisher,. The subscriber continues to cycle through all the tables
that are not status r in the pg_subscription_rel table. Once there no
slots available on the publisher the subscriber attempts to create slots,
fails and returns the above errors.

These are not critical errors however it is difficult to clean up as
every time starting and stopping the subscriber it immediately retries to
initial sync the tables it has slots for. So it keeps using up all the
publisher slots.

You need to look through the logs looking for the initial error that kicked
this off, it can be a duplicate key or the subscriber or publisher
disconnected due to wal_reciever_timeout or wal_sender_timerout.

I typical disable all the timeouts as it does not take much to cause the
subscriber and publisher to think the connection has timed out.

A tell tale sign of this is the pg_subscription_rel status will be d for a
bunch of tables. There should only be few tables that are status d,

I only attempt to clean this up when working on big databases and a lot of
tables have already been sync.

Need to figure out the initial cause of the problem, correct that error and
attempt an initial sync again.

.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message jian he 2024-02-26 04:04:24 Re: Documentation diff
Previous Message Mike Lissner 2024-02-25 18:11:23 Re: Local replication "slot does not exist" after initial sync