Re: BUG #18433: Logical replication timeout

From: Kostiantyn Tomakh <tomahkvt(at)gmail(dot)com>
To: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18433: Logical replication timeout
Date: 2024-04-24 06:03:32
Message-ID: CAJP09w7kUc6pMeB8jMMTvw_Yqt43XS8wSH4NxCf4B4x2=ARqeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi, Shlok Kyal.
I want to add that when we tried to use REPLICA IDENTITY default in
Postgresql 13.
We got following message:
2024-03-29 06:53:52 UTC::@:[28650]:LOG: logical replication apply worker
for subscription "db_name_public_subscription" has started 2024-03-29
06:53:52 UTC::@:[28651]:LOG: logical replication table synchronization
worker for subscription "db_name_public_subscription", table "table" has
started 2024-03-29 06:54:06 UTC::@:[408]:LOG: checkpoint starting: time
2024-03-29 06:55:36 UTC::@:[28651]:ERROR: logical replication target
relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY
and published relation does not have REPLICA IDENTITY FULL 2024-03-29
06:55:36 UTC::@:[396]:LOG: background worker "logical replication worker"
(PID 28651) exited with exit code 1 2024-03-29 06:55:36 UTC::@:[31784]:LOG:
logical replication table synchronization worker for subscription
"db_name_public_subscription", table "table" has started
Startup state starts from beginning.
The table at destination DB overwrites many times as result table at
destination DB becomes bigger than source.

The startup state starts from the beginning. The table at destination DB
overwrites many times as a result table at destination DB becomes bigger
than the source. This behavior in the Postgresql 13.
In Postgresql 10
As I remember There wasn't "background worker "logical replication worker"
(PID 28651) exited with exit code 1" in Postgresql 10.
1) Destination DB informed that the "logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table_name" has finished"
2) Then Destination DB informed periodically that the destination table
didn't have PRIMARY KEY.There wasn't message "background worker "logical
replication worker" (PID 28651) exited with exit code 1". Replication
process waited until problem was fixed.
3) We created a Primary Key at destination DB.
4) The replication process continued without any fails or restart.

вт, 23 апр. 2024 г. в 12:58, Kostiantyn Tomakh <tomahkvt(at)gmail(dot)com>:

>
> Hi, Shlok Kyal.
> 1. We set wal_sender_timeout=1h and wal_receiver_timeout=1h
> 2. Logs after this changes for table size 39GB
>
> DB_source logs
> 2024-04-22 14:13:25 UTC:10.10.10.10(64442):replication_role(at)db:[16497]:STATEMENT: CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput NOEXPORT_SNAPSHOT
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: starting logical decoding for slot "db_name_public_subscription"
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:DETAIL: Streaming transactions committing after 139C2/AD4A6FA0, reading WAL from 139C2/AD414CC0.
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: logical decoding found initial starting point at 139C2/AD416018
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:DETAIL: Waiting for transactions (approximately 3) older than 2782438580 to end.
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: logical decoding found consistent point at 139C2/AD4A6F68
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:DETAIL: There are no running transactions.
> 2024-04-22 14:13:25 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: logical decoding found consistent point at 139C2/AD637D58
> 2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:DETAIL: There are no running transactions.
> 2024-04-22 14:13:25 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT: CREATE_REPLICATION_SLOT "db_name_public_subscription_18989252_sync_17090" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
> 2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: starting logical decoding for slot "db_name_public_subscription_18989252_sync_17090"
> 2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:DETAIL: Streaming transactions committing after 139C2/AD637D90, reading WAL from 139C2/AD6317D8.
> 2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: logical decoding found consistent point at 139C2/AD637D58
> 2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:DETAIL: There are no running transactions.
> 2024-04-22 14:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: terminating walsender process due to replication timeout
> 2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:CONTEXT: slot "db_name_public_subscription", output plugin "pgoutput", in the commit callback, associated LSN 139C6/3C286C98
> 2024-04-22 15:38:35 UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0 (proto_version '1', publication_names '"db_name_public_publication"')
> 2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: terminating walsender process due to replication timeout
> 2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:CONTEXT: slot "db_name_public_subscription_18989252_sync_17090", output plugin "pgoutput", in the begin callback, associated LSN 139C5/4C0A92B8
> 2024-04-22 15:38:35 UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT: START_REPLICATION SLOT "db_name_public_subscription_18989252_sync_17090" LOGICAL 139C2/AD637D90 (proto_version '1', publication_names '"db_name_public_publication"')
>
> DB_destination logs
>
> 2024-04-22 14:13:25 UTC::@:[847]:LOG: logical replication apply worker for subscription "db_name_public_subscription" has started
> 2024-04-22 14:13:25 UTC::@:[848]:LOG: logical replication table synchronization worker for subscription "db_name_public_subscription", table "table" has started
> 2024-04-22 15:00:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:00:58 UTC::@:[403]:LOG: checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.752 s, sync=0.005 s, total=1.782 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65540 kB, estimate=4065870 kB
> 2024-04-22 15:05:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:05:58 UTC::@:[403]:LOG: checkpoint complete: wrote 12 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.249 s, sync=0.005 s, total=1.277 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65487 kB, estimate=3665831 kB
> 2024-04-22 15:10:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:11:00 UTC::@:[403]:LOG: checkpoint complete: wrote 24 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=2.455 s, sync=0.005 s, total=2.483 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65631 kB, estimate=3305811 kB
> 2024-04-22 15:15:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:15:59 UTC::@:[403]:LOG: checkpoint complete: wrote 16 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.652 s, sync=0.006 s, total=1.719 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65541 kB, estimate=2981784 kB
> 2024-04-22 15:20:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:20:59 UTC::@:[403]:LOG: checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.006 s, total=1.783 s; sync files=4, longest=0.006 s, average=0.002 s; distance=65522 kB, estimate=2690158 kB
> 2024-04-22 15:25:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:25:58 UTC::@:[403]:LOG: checkpoint complete: wrote 6 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.645 s, sync=0.004 s, total=0.673 s; sync files=3, longest=0.004 s, average=0.002 s; distance=65456 kB, estimate=2427688 kB
> 2024-04-22 15:30:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:30:59 UTC::@:[403]:LOG: checkpoint complete: wrote 15 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.550 s, sync=0.005 s, total=1.580 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65573 kB, estimate=2191476 kB
> 2024-04-22 15:35:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:35:59 UTC::@:[403]:LOG: checkpoint complete: wrote 17 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.751 s, sync=0.005 s, total=1.779 s; sync files=4, longest=0.005 s, average=0.002 s; distance=65541 kB, estimate=1978883 kB
> 2024-04-22 15:40:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:40:58 UTC::@:[403]:LOG: checkpoint complete: wrote 7 buffers (0.0%); 0 WAL file(s) added, 2 removed, 0 recycled; write=0.746 s, sync=0.004 s, total=0.778 s; sync files=3, longest=0.004 s, average=0.002 s; distance=131032 kB, estimate=1794098 kB
> 2024-04-22 15:45:57 UTC::@:[403]:LOG: checkpoint starting: time
> 2024-04-22 15:45:59 UTC::@:[403]:LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=1.148 s, sync=0.005 s, total=1.180 s; sync files=3, longest=0.005 s, average=0.002 s; distance=65533 kB, estimate=1621241 kB
> 2024-04-22 15:48:33 UTC::@:[847]:FATAL: terminating logical replication worker due to administrator command
> 2024-04-22 15:48:33 UTC::@:[400]:LOG: background worker "logical replication worker" (PID 847) exited with exit code 1
> 2024-04-22 15:48:33 UTC::@:[848]:FATAL: terminating logical replication worker due to administrator command
> 2024-04-22 15:48:33 UTC::@:[400]:LOG: background worker "logical replication worker" (PID 848) exited with exit code 1
>
> 3. This log was got for a table size of 39 GB but we have tables much bigger than this.
>
> Our workflow for the tables size 39 GB
>
> 1) Download schema from the source database instance
>
> 2) Deleted PK, FK, and Indexes from the table schema(we did it to increase data load, startup process)
> 3)Upload the schema to the destination DB.
> 4) configurated identity replication full at source DB for this table
> 5) Configured Logical replication between source and destination DB this table
> 6) During catchup on this table process we got the messages in log that you can see above.
>
> 7) We also tried create primary key for this table during catchup state but this process was blocked by logical replication worker so we had to cancel primary key creation process.
>
> I want to highlight that we used this workflow for PostgreSQL 10 and it worked.
>
> 4. Can you clarify the point 'One important point. If there is no request to source DB logical replication works fine for big tables.'?
>
> I meant that if the source database doesn't have delete, insert, update queries catch up process very short and we don't have this problem.
>
> Thank you in advance.
>
>
>
>
> пт, 19 апр. 2024 г. в 14:51, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>:
>
>> Hi,
>>
>> > Hello, Shlok Kyal. Thank you very much for your answer.
>> > Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout
>> from 30seconds up to 300 seconds for source and destination DB but we got
>> the same error.
>> > We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error
>> at destination DB "ERROR: logical replication target relation
>> "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and
>> published relation does not have REPLICA IDENTITY FULL". We didn't have
>> such errors on Postgresql 10.
>> >
>> >
>> > ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>:
>> >>
>> >> Hi,
>> >>
>> >> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> >> <noreply(at)postgresql(dot)org> wrote:
>> >> >
>> >> > The following bug has been logged on the website:
>> >> >
>> >> > Bug reference: 18433
>> >> > Logged by: Kostiantyn
>> >> > Email address: tomahkvt(at)gmail(dot)com
>> >> > PostgreSQL version: 13.14
>> >> > Operating system: AWS RDS
>> >> > Description:
>> >> >
>> >> > On Postgresql 10 we used the following approach for the big tables:
>> >> > 1) Download schema from the source database instance
>> >> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> >> > schema
>> >> > 3)Upload the schema to the destination DB.
>> >> > 4) Configure Logical replication between source and destination DB.
>> >> > 5) When last table logical replication table synchronization worker
>> for
>> >> > subscription "db_name_public_subscription", table "table_name" has
>> >> > finished
>> >> > 6) we created all the necessary PK, FK, and Indexes.
>> >> > This approach allowed to us upload data more quickly. This approach
>> was
>> >> > working great on PostgreSQL 10.
>> >> >
>> >> > We tried the same approach for Postgresql13, but we got an error.
>> >> > 1) Download schema from the source database instance
>> >> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> >> > schema
>> >> > 3)Upload the schema to the destination DB.
>> >> > 4) configurated identity replication full at source DB for tables
>> bigger
>> >> > than 100Gb
>> >> > 5) Configured Logical replication between source and destination DB.
>> >> > 6) During catchup on this big table process we got the following
>> >> > messages:
>> >> > Source DB
>> >> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG:
>> terminating
>> >> > walsender process due to replication timeout
>> >> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT:
>> slot
>> >> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> >> > callback, associated LSN 13705/2E913050
>> >> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> >> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL
>> 13702/C2C8FB30
>> >> > (proto_version '1', publication_names '"db_name_public_publication"')
>> >> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG:
>> terminating
>> >> > walsender process due to replication timeout
>> >> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT:
>> slot
>> >> > "db_name_public_subscription_18989108_sync_17127", output plugin
>> "pgoutput",
>> >> > in the begin callback, associated LSN 13703/27942B70
>> >> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> >> > START_REPLICATION SLOT
>> "db_name_public_subscription_18989108_sync_17127"
>> >> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> >> > '"db_name_public_publication"')
>> >> >
>> >> > One important point. If there is no request to source DB logical
>> replication
>> >> > works fine for big tables.
>> >> > I saw the messages in PostgreSQL bugs like
>> >> >
>> https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> >> > But I also did some tests and increased wal_keep_size
>> >> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to
>> 1h but
>> >> > without success. The setup works in PG 13 only with a small amount
>> of data.
>> >>
>> >> I went through the issue and I think that the given logs are appearing
>> >> due to some delay in the Apply Worker process.
>> >> I could reproduce it in my development environment by applying delays
>> >> in Apply Worker.
>> >>
>> >> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> >> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> >> value?
>> >>
>> >> Also, I noticed that while using Postgresql13 you are configuring the
>> >> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> >> Postgresql10. Is there any specific reason for it?
>> >> I pointed it out because REPLICA IDENTITY FULL has a different
>> >> behaviour and sets the entire row as key.
>>
>> Few suggestions:
>> 1. Can you try increasing wal_sender_timeout and wal_receiver_timeout
>> to more than 1hrs? As in [1] you mentioned you were getting the same
>> issue for wal_sender_timeout set to 1hr.
>> 2. Do you find any other unusual logs in SourceDB or any errors logs
>> in DestinationDB, If so please share them ?
>> 3. Is it possible to share a reproducible testcase for the issue?
>> 4. I observe that if we take a lock on table in DestinationDB which is
>> subscribed to publication in SourceDB, we may end up getting the same
>> logs as you mentioned in [1]. Can you check if anything similar is
>> happening?
>> 5. Can you clarify the point 'One important point. If there is no
>> request to source DB logical replication works fine for big tables.'?
>>
>> [1]:
>> https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org
>>
>> Thanks and Regards
>> Shlok Kyal
>>
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2024-04-24 07:24:19 Re: BUG #15954: Unable to alter partitioned table to set logged
Previous Message David Rowley 2024-04-24 02:25:56 Re: BUG #17855: Uninitialised memory used when the name type value processed in binary mode of Memoize