Re: BUG #18433: Logical replication timeout

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Костянтин Томах <tomahkvt(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18433: Logical replication timeout
Date: 2024-04-19 11:51:08
Message-ID: CANhcyEVs-F7G+bf3j2sym-GHi1hJ0Eo+LmoRV3ONzbjEFW0W4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

> Hello, Shlok Kyal. Thank you very much for your answer.
> Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout from 30seconds up to 300 seconds for source and destination DB but we got the same error.
> We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at destination DB "ERROR: logical replication target relation "public.table" has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation does not have REPLICA IDENTITY FULL". We didn't have such errors on Postgresql 10.
>
>
> ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>:
>>
>> Hi,
>>
>> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
>> <noreply(at)postgresql(dot)org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference: 18433
>> > Logged by: Kostiantyn
>> > Email address: tomahkvt(at)gmail(dot)com
>> > PostgreSQL version: 13.14
>> > Operating system: AWS RDS
>> > Description:
>> >
>> > On Postgresql 10 we used the following approach for the big tables:
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) Configure Logical replication between source and destination DB.
>> > 5) When last table logical replication table synchronization worker for
>> > subscription "db_name_public_subscription", table "table_name" has
>> > finished
>> > 6) we created all the necessary PK, FK, and Indexes.
>> > This approach allowed to us upload data more quickly. This approach was
>> > working great on PostgreSQL 10.
>> >
>> > We tried the same approach for Postgresql13, but we got an error.
>> > 1) Download schema from the source database instance
>> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
>> > schema
>> > 3)Upload the schema to the destination DB.
>> > 4) configurated identity replication full at source DB for tables bigger
>> > than 100Gb
>> > 5) Configured Logical replication between source and destination DB.
>> > 6) During catchup on this big table process we got the following
>> > messages:
>> > Source DB
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT: slot
>> > "db_name_public_subscription", output plugin "pgoutput", in the begin
>> > callback, associated LSN 13705/2E913050
>> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 13702/C2C8FB30
>> > (proto_version '1', publication_names '"db_name_public_publication"')
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG: terminating
>> > walsender process due to replication timeout
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT: slot
>> > "db_name_public_subscription_18989108_sync_17127", output plugin "pgoutput",
>> > in the begin callback, associated LSN 13703/27942B70
>> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
>> > START_REPLICATION SLOT "db_name_public_subscription_18989108_sync_17127"
>> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
>> > '"db_name_public_publication"')
>> >
>> > One important point. If there is no request to source DB logical replication
>> > works fine for big tables.
>> > I saw the messages in PostgreSQL bugs like
>> > https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
>> > But I also did some tests and increased wal_keep_size
>> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h but
>> > without success. The setup works in PG 13 only with a small amount of data.
>>
>> I went through the issue and I think that the given logs are appearing
>> due to some delay in the Apply Worker process.
>> I could reproduce it in my development environment by applying delays
>> in Apply Worker.
>>
>> I think this issue can be resolved by setting 'wal_sender_timeout' to
>> a greater value. Can you try setting 'wal_sender_timeout' to a greater
>> value?
>>
>> Also, I noticed that while using Postgresql13 you are configuring the
>> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
>> Postgresql10. Is there any specific reason for it?
>> I pointed it out because REPLICA IDENTITY FULL has a different
>> behaviour and sets the entire row as key.

Few suggestions:
1. Can you try increasing wal_sender_timeout and wal_receiver_timeout
to more than 1hrs? As in [1] you mentioned you were getting the same
issue for wal_sender_timeout set to 1hr.
2. Do you find any other unusual logs in SourceDB or any errors logs
in DestinationDB, If so please share them ?
3. Is it possible to share a reproducible testcase for the issue?
4. I observe that if we take a lock on table in DestinationDB which is
subscribed to publication in SourceDB, we may end up getting the same
logs as you mentioned in [1]. Can you check if anything similar is
happening?
5. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

[1]: https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org

Thanks and Regards
Shlok Kyal

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Petrov 2024-04-19 13:22:05 Re: BUG #18351: VACUUM FULL fails with error: missing chunk number 0 for toast value XXX
Previous Message PG Bug reporting form 2024-04-19 03:10:24 BUG #18442: Unnecessary Sort operator in indexScan Plan