Re: Runaway Initial Table Syncs in Logical Replication?

From: Don Seiler <don(at)seiler(dot)us>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Runaway Initial Table Syncs in Logical Replication?
Date: 2023-08-04 14:28:48
Message-ID: CAHJZqBAkg27MFv9xbMTuJvcL25EhS2mzxynDr7bLFPcdqZZ=ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Aug 3, 2023 at 9:12 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
wrote:

>
> Just to confirm, which subversion of PG15 are you using on the subscriber
> side ?
> Is it 15.3 ? I am confirming because I recall we fixed similar bug before
> in commit [1].
>

Yes. Specifically on the target/subscriber:

# select version();
version
-----------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 15.3 (Ubuntu 15.3-1.pgdg22.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
(1 row)

Installed on Ubuntu 22.04 via the PGDG apt repo.

The source/publisher side is:

# select version();
version
-----------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.15 (Ubuntu 12.15-1.pgdg18.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
(1 row)

Are there any ERRORs(including erros other than the slots number) in the
> log of publisher and subscriber.
> It would be helpful to see the log in case there are some other ERRORs
> that could cause problem.
>

On the subscriber I see a series of errors similar to this:

2023-08-03 18:22:50.196 UTC [239889] LOG: logical replication table
synchronization worker for subscription "sub01", table "users" has started
2023-08-03 18:22:50.209 UTC [239889] ERROR: could not create replication
slot "pg_19742_sync_18602_7263122209699118815": ERROR: all replication
slots are in use
HINT: Free one or increase max_replication_slots.

on the publisher I see this corresponding message:

2023-08-03 18:22:50.199 UTC [13884] [unknown](at)[unknown] - [unknown] LOG:
connection received: host=10.228.88.74 port=47338
2023-08-03 18:22:50.207 UTC [13884] migrator(at)foo - [unknown] LOG:
replication connection authorized: user=migrator
application_name=pg_19742_sync_18602_7263122209699118815 SSL enabled
(protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-08-03 18:22:50.209 UTC [13884] migrator(at)foo -
pg_19742_sync_18602_7263122209699118815 ERROR: all replication slots are
in use
2023-08-03 18:22:50.209 UTC [13884] migrator(at)foo -
pg_19742_sync_18602_7263122209699118815 HINT: Free one or increase
max_replication_slots.
2023-08-03 18:22:50.209 UTC [13884] migrator(at)foo -
pg_19742_sync_18602_7263122209699118815 STATEMENT: CREATE_REPLICATION_SLOT
"pg_19742_sync_18602_7263122209699118815" LOGICAL pgoutput USE_SNAPSHOT
2023-08-03 18:22:50.209 UTC [13884] migrator(at)foo -
pg_19742_sync_18602_7263122209699118815 LOG: disconnection: session time:
0:00:00.009 user=migrator database=foo host=10.228.88.74 port=47338

and later they include this message about trying to drop them first on the
subscriber:

2023-08-03 18:47:39.485 UTC [263502] LOG: logical replication table
synchronization worker for subscription "sub01", table "foo_accounts" has
started
2023-08-03 18:47:39.496 UTC [263502] LOG: could not drop replication slot
"pg_19742_sync_17238_7263122209699118815" on publisher: ERROR: replication
slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:39.497 UTC [263502] ERROR: could not create replication
slot "pg_19742_sync_17238_7263122209699118815": ERROR: all replication
slots are in use
HINT: Free one or increase max_replication_slots.

and on the publisher side for that:

2023-08-03 18:47:39.487 UTC [3232] [unknown](at)[unknown] - [unknown] LOG:
connection received: host=10.228.88.74 port=17390
2023-08-03 18:47:39.494 UTC [3232] migrator(at)foo - [unknown] LOG:
replication connection authorized: user=migrator
application_name=pg_19742_sync_17238_7263122209699118815 SSL enabled
(protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2023-08-03 18:47:39.495 UTC [3232] migrator(at)foo -
pg_19742_sync_17238_7263122209699118815 ERROR: replication slot
"pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:39.495 UTC [3232] migrator(at)foo -
pg_19742_sync_17238_7263122209699118815 STATEMENT: DROP_REPLICATION_SLOT
pg_19742_sync_17238_7263122209699118815 WAIT
2023-08-03 18:47:39.496 UTC [3232] migrator(at)foo -
pg_19742_sync_17238_7263122209699118815 ERROR: all replication slots are
in use
2023-08-03 18:47:39.496 UTC [3232] migrator(at)foo -
pg_19742_sync_17238_7263122209699118815 HINT: Free one or increase
max_replication_slots.
2023-08-03 18:47:39.496 UTC [3232] migrator(at)foo -
pg_19742_sync_17238_7263122209699118815 STATEMENT: CREATE_REPLICATION_SLOT
"pg_19742_sync_17238_7263122209699118815" LOGICAL pgoutput USE_SNAPSHOT
2023-08-03 18:47:39.496 UTC [3232] migrator(at)foo -
pg_19742_sync_17238_7263122209699118815 LOG: disconnection: session time:
0:00:00.009 user=migrator datab
ase=foo host=10.228.88.74 port=17390

and then this from a different pid, maybe some kind of cleanup process? It
was doing this for what seemed like all of the sync slots:

2023-08-03 18:47:47.975 UTC [3303] migrator(at)foo - sub01 ERROR: replication
slot "pg_19742_sync_17238_7263122209699118815" does not exist
2023-08-03 18:47:47.975 UTC [3303] migrator(at)foo - sub01 STATEMENT:
DROP_REPLICATION_SLOT pg_19742_sync_17238_7263122209699118815 WAIT

--
Don Seiler
www.seiler.us

In response to

Browse pgsql-general by date

  From Date Subject
Next Message jacktby jacktby 2023-08-05 13:46:33 How does pg parse 'select '(1,2)''
Previous Message Karsten Hilbert 2023-08-04 12:06:18 Re: question on auto_explain