Re: Subscription stuck at initialize state

From: Abhishek Bhola <abhishek(dot)bhola(at)japannext(dot)co(dot)jp>
To: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
Cc: Steve Baldwin <steve(dot)baldwin(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Subscription stuck at initialize state
Date: 2022-02-03 23:02:31
Message-ID: CAEDsCzgLHLVQXbrorugGek0iVA17W=YNz4H_uDV7jb2GBWV4_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Vijaykumar

Thank you so much for taking out so much of your time to recreate the bug.
I checked the max_logical_replication_workers on both the nodes and they
are set at 4 at the moment.
The reason why it is failing is that there are 3 existing replications on
the target node already and when I create this 4th one, it is not able to
create that temporary replication worker, needed to copy the initial data,
and therefore it never moves ahead of that state.
Since these are production databases, I can't restart them during the
weekdays, so I will try to change them during the weekend.
But I did find the warning message in the target DB logs.

"WARNING,53400,"out of logical replication worker slots",,"You might need
to increase max_logical_replication_workers.",,,,,,,"","logical replication
worker"
So I am sure this will work.

THANK YOU SO MUCH.

On Fri, Feb 4, 2022 at 3:04 AM Vijaykumar Jain <
vijaykumarjain(dot)github(at)gmail(dot)com> wrote:

> trimming the email, to avoid noise.
>
> I spent a lot of time trying multiple options/combinations and finally
> managed to replicate your debug output.
> this is when i have ,
>
> postgres=# show max_logical_replication_workers;
> * max_logical_replication_workers*
> ---------------------------------
> * 2*
> (1 row)
>
>
> on publisher
> postgres(at)controller:~$ psql -p 5001
> psql (14.1 (Ubuntu 14.1-2.pgdg20.04+1))
> Type "help" for help.
>
> postgres=# create table t_bytea(id int primary key, somebytea bytea);
> CREATE TABLE
> postgres=# create table t1(id int);
> CREATE TABLE
> postgres=# create table t2(id int);
> CREATE TABLE
> postgres=# create table t3(id int);
> CREATE TABLE
> postgres=# create publication mypub1 for table t1;
> CREATE PUBLICATION
> postgres=# create publication mypub2 for table t2;
> CREATE PUBLICATION
> postgres=# create publication mypub3 for table t3;
> CREATE PUBLICATION
> postgres=# create publication mypub4 for table t3;
> CREATE PUBLICATION
> postgres=# create publication mypub5 for table t_bytea;
> CREATE PUBLICATION
> postgres=# insert into t_bytea select x,repeat(repeat('xxx', 100),
> 1000)::bytea from generate_series(1, 1000) x;
> INSERT 0 1000
>
>
>
> on subscriber
> postgres(at)controller:~$ psql -p 5002
> psql (14.1 (Ubuntu 14.1-2.pgdg20.04+1))
> Type "help" for help.
>
> postgres=# create table t_bytea(id int primary key, somebytea bytea);
> CREATE TABLE
> postgres=# create table t1(id int);
> CREATE TABLE
> postgres=# create table t2(id int);
> CREATE TABLE
> postgres=# create table t3(id int);
> CREATE TABLE
> postgres=# create table t4(id int);
> CREATE TABLE
> postgres=# create table t5(id int);
> CREATE TABLE
> postgres=# create subscription mysub1 connection 'port=5001' publication
> mypub1;
> NOTICE: created replication slot "mysub1" on publisher
> CREATE SUBSCRIPTION
> postgres=# create subscription mysub2 connection 'port=5001' publication
> mypub2;
> NOTICE: created replication slot "mysub2" on publisher
> CREATE SUBSCRIPTION
> postgres=# create subscription mysub3 connection 'port=5001' publication
> mypub3;
> NOTICE: created replication slot "mysub3" on publisher
> CREATE SUBSCRIPTION
> postgres=# create subscription mysub4 connection 'port=5001' publication
> mypub4;
> NOTICE: created replication slot "mysub4" on publisher
> CREATE SUBSCRIPTION
> postgres=# create subscription mysub5 connection 'port=5001' publication
> mypub5;
> NOTICE: created replication slot "mysub5" on publisher
> CREATE SUBSCRIPTION
> *postgres=# select count(1) from t_bytea;*
> * count*
> *-------*
> * 0*
> *(1 row)*
>
> postgres=# table pg_subscription_rel;
> srsubid | srrelid | srsubstate | srsublsn
> ---------+---------+------------+-----------
> 16406 | 16391 | r | 0/1722838
> * 16407 | 16394 | i |*
> * 16408 | 16397 | i |*
> * 16409 | 16397 | i |*
> * 16410 | 16384 | i |*
> (5 rows)
>
> # as expected no data in t_bytea (as it could not get any worker)
> postgres=# select count(1) from t_bytea;
> count
> -------
> 0
> (1 row)
>
>
>
> but logs clearly state the problem
> 2022-02-03 23:18:31.107 IST [3430] LOG: logical replication table
> synchronization worker for subscription "mysub1", table "t1" has started
> 2022-02-03 23:18:31.138 IST [3430] LOG: logical replication table
> synchronization worker for subscription "mysub1", table "t1" has finished
> 2022-02-03 23:18:40.730 IST [3433] LOG: logical replication apply worker
> for subscription "mysub2" has started
> *2022-02-03 23:18:40.737 IST [3433] WARNING: out of logical replication
> worker slots*
> *2022-02-03 23:18:40.737 IST [3433] HINT: You might need to increase
> max_logical_replication_workers.*
> *2022-02-03 23:18:45.865 IST [3433] WARNING: out of logical replication
> worker slots*
>
>
> *#publisher logs*
> 2022-02-03 23:18:31.096 IST [3427] STATEMENT: CREATE_REPLICATION_SLOT
> "mysub1" LOGICAL pgoutput NOEXPORT_SNAPSHOT
> 2022-02-03 23:18:31.106 IST [3429] LOG: starting logical decoding for
> slot "mysub1"
> 2022-02-03 23:18:31.106 IST [3429] DETAIL: Streaming transactions
> committing after 0/1722800, reading WAL from 0/17227C8.
> 2022-02-03 23:18:31.106 IST [3429] STATEMENT: START_REPLICATION SLOT
> "mysub1" LOGICAL 0/0 (proto_version '2', publication_names '"mypub1"')
> 2022-02-03 23:18:31.106 IST [3429] LOG: logical decoding found consistent
> point at 0/17227C8
> 2022-02-03 23:18:31.106 IST [3429] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:31.106 IST [3429] STATEMENT: START_REPLICATION SLOT
> "mysub1" LOGICAL 0/0 (proto_version '2', publication_names '"mypub1"')
> 2022-02-03 23:18:31.129 IST [3431] LOG: logical decoding found consistent
> point at 0/1722800
> 2022-02-03 23:18:31.129 IST [3431] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:31.129 IST [3431] STATEMENT: CREATE_REPLICATION_SLOT
> "pg_16406_sync_16391_7060540926182153512" LOGICAL pgoutput USE_SNAPSHOT
> 2022-02-03 23:18:31.135 IST [3431] LOG: starting logical decoding for
> slot "pg_16406_sync_16391_7060540926182153512"
> 2022-02-03 23:18:31.135 IST [3431] DETAIL: Streaming transactions
> committing after 0/1722838, reading WAL from 0/1722800.
> 2022-02-03 23:18:31.135 IST [3431] STATEMENT: START_REPLICATION SLOT
> "pg_16406_sync_16391_7060540926182153512" LOGICAL 0/1722838 (proto_version
> '2', publication_names '"mypub1"')
> 2022-02-03 23:18:35.718 IST [3432] LOG: logical decoding found consistent
> point at 0/1722838
> 2022-02-03 23:18:35.718 IST [3432] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:35.718 IST [3432] STATEMENT: CREATE_REPLICATION_SLOT
> "mysub2" LOGICAL pgoutput NOEXPORT_SNAPSHOT
> 2022-02-03 23:18:40.737 IST [3434] LOG: starting logical decoding for
> slot "mysub2"
> 2022-02-03 23:18:40.737 IST [3434] DETAIL: Streaming transactions
> committing after 0/1722870, reading WAL from 0/1722838.
> 2022-02-03 23:18:40.737 IST [3434] STATEMENT: START_REPLICATION SLOT
> "mysub2" LOGICAL 0/0 (proto_version '2', publication_names '"mypub2"')
> 2022-02-03 23:18:40.737 IST [3434] LOG: logical decoding found consistent
> point at 0/1722838
> 2022-02-03 23:18:40.737 IST [3434] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:40.737 IST [3434] STATEMENT: START_REPLICATION SLOT
> "mysub2" LOGICAL 0/0 (proto_version '2', publication_names '"mypub2"')
> 2022-02-03 23:18:40.857 IST [3435] LOG: logical decoding found consistent
> point at 0/1722870
> 2022-02-03 23:18:40.857 IST [3435] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:40.857 IST [3435] STATEMENT: CREATE_REPLICATION_SLOT
> "mysub3" LOGICAL pgoutput NOEXPORT_SNAPSHOT
> 2022-02-03 23:18:46.091 IST [3437] LOG: logical decoding found consistent
> point at 0/17228A8
> 2022-02-03 23:18:46.091 IST [3437] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:46.091 IST [3437] STATEMENT: CREATE_REPLICATION_SLOT
> "mysub4" LOGICAL pgoutput NOEXPORT_SNAPSHOT
> 2022-02-03 23:18:51.941 IST [3439] LOG: logical decoding found consistent
> point at 0/17228E0
> 2022-02-03 23:18:51.941 IST [3439] DETAIL: There are no running
> transactions.
> 2022-02-03 23:18:51.941 IST [3439] STATEMENT: CREATE_REPLICATION_SLOT
> "mysub5" LOGICAL pgoutput NOEXPORT_SNAPSHOT
>
>
>
> *# solution*
> so i bump the max_logical_replication_workers = 10 and restart the db
> servers.
> as there are workers available, replication catches up and tables in sync.
>
> # on subscriber
> psql (14.1 (Ubuntu 14.1-2.pgdg20.04+1))
> Type "help" for help.
>
> postgres=# select count(1) from t_bytea;
> count
> -------
> 0
> (1 row)
>
> postgres=# table pg_subscription_rel;
> srsubid | srrelid | srsubstate | srsublsn
> ---------+---------+------------+-----------
> 16406 | 16391 | r | 0/1722838
> 16407 | 16394 | r | 0/2000110
> 16408 | 16397 | r | 0/2000148
> 16409 | 16397 | r | 0/2000148
> 16410 | 16384 | r | 0/20001B8
> (5 rows)
>
> postgres=# select count(1) from t_bytea;
> count
> -------
> 1000
> (1 row)
>
>
> so two things,
> can you check logs for warning messages of out of *logical replication
> worker slots *on subscribers ?
> can you try bumping them, and check if it catches up (srsubstate should
> either be *r(READY)* or *d (INITIAL COPY))*
>
> also monitor logs for both publisher and subscriber.
>
> I also figured out all the debugging steps I requested earlier were
> useless.
>
>
>
> --
> Thanks,
> Vijay
> LinkedIn - Vijaykumar Jain <https://www.linkedin.com/in/vijaykumarjain/>
>

--
_This correspondence (including any attachments) is for the intended
recipient(s) only. It may contain confidential or privileged information or
both. No confidentiality or privilege is waived or lost by any
mis-transmission. If you receive this correspondence by mistake, please
contact the sender immediately, delete this correspondence (and all
attachments) and destroy any hard copies. You must not use, disclose, copy,
distribute or rely on any part of this correspondence (including any
attachments) if you are not the intended
recipient(s).本メッセージに記載および添付されている情報(以下、総称して「本情報」といいます。)は、本来の受信者による使用のみを意図しています。誤送信等により本情報を取得された場合でも、本情報に係る秘密、または法律上の秘匿特権が失われるものではありません。本電子メールを受取られた方が、本来の受信者ではない場合には、本情報及びそのコピーすべてを削除・破棄し、本電子メールが誤って届いた旨を発信者宛てにご通知下さいますようお願いします。本情報の閲覧、発信または本情報に基づくいかなる行為も明確に禁止されていることをご了承ください。_

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michael Harris 2022-02-04 01:08:58 Re: Undetected Deadlock
Previous Message Michael Lewis 2022-02-03 22:30:22 Re: increasing effective_cache_size slows down join queries by a factor of 4000x