Fwd: BUG #18433: Logical replication timeout

From: Kostiantyn Tomakh <tomahkvt(at)gmail(dot)com>
To: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Fwd: BUG #18433: Logical replication timeout
Date: 2024-04-23 09:58:21
Message-ID: CAJP09w6nGkXPSd0zZ1PyMiRGMOa4-2gCobb9TzJJoJry-6i0fQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi, Shlok Kyal.
1. We set wal_sender_timeout=1h and wal_receiver_timeout=1h
2. Logs after this changes for table size 39GB

DB_source logs
2024-04-22 14:13:25
UTC:10.10.10.10(64442):replication_role(at)db:[16497]:STATEMENT:
CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput
NOEXPORT_SNAPSHOT
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: starting
logical decoding for slot "db_name_public_subscription"
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:DETAIL: Streaming
transactions committing after 139C2/AD4A6FA0, reading WAL from
139C2/AD414CC0.
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: logical
decoding found initial starting point at 139C2/AD416018
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:DETAIL: Waiting
for transactions (approximately 3) older than 2782438580 to end.
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: logical
decoding found consistent point at 139C2/AD4A6F68
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:DETAIL: There are
no running transactions.
2024-04-22 14:13:25
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 14:13:25
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: logical
decoding found consistent point at 139C2/AD637D58
2024-04-22 14:13:25
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:DETAIL: There are
no running transactions.
2024-04-22 14:13:25
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT:
CREATE_REPLICATION_SLOT
"db_name_public_subscription_18989252_sync_17090" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2024-04-22 14:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: starting
logical decoding for slot
"db_name_public_subscription_18989252_sync_17090"
2024-04-22 14:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:DETAIL: Streaming
transactions committing after 139C2/AD637D90, reading WAL from
139C2/AD6317D8.
2024-04-22 14:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT:
START_REPLICATION SLOT
"db_name_public_subscription_18989252_sync_17090" LOGICAL
139C2/AD637D90 (proto_version '1', publication_names
'"db_name_public_publication"')
2024-04-22 14:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: logical
decoding found consistent point at 139C2/AD637D58
2024-04-22 14:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:DETAIL: There are
no running transactions.
2024-04-22 14:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT:
START_REPLICATION SLOT
"db_name_public_subscription_18989252_sync_17090" LOGICAL
139C2/AD637D90 (proto_version '1', publication_names
'"db_name_public_publication"')
2024-04-22 15:38:35
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:LOG: terminating
walsender process due to replication timeout
2024-04-22 15:38:35
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:CONTEXT: slot
"db_name_public_subscription", output plugin "pgoutput", in the commit
callback, associated LSN 139C6/3C286C98
2024-04-22 15:38:35
UTC:10.10.10.10(64446):replication_role(at)db:[16500]:STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
2024-04-22 15:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:LOG: terminating
walsender process due to replication timeout
2024-04-22 15:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:CONTEXT: slot
"db_name_public_subscription_18989252_sync_17090", output plugin
"pgoutput", in the begin callback, associated LSN 139C5/4C0A92B8
2024-04-22 15:38:35
UTC:10.10.10.10(64462):replication_role(at)db:[16501]:STATEMENT:
START_REPLICATION SLOT
"db_name_public_subscription_18989252_sync_17090" LOGICAL
139C2/AD637D90 (proto_version '1', publication_names
'"db_name_public_publication"')

DB_destination logs

2024-04-22 14:13:25 UTC::@:[847]:LOG: logical replication apply
worker for subscription "db_name_public_subscription" has started
2024-04-22 14:13:25 UTC::@:[848]:LOG: logical replication table
synchronization worker for subscription "db_name_public_subscription",
table "table" has started
2024-04-22 15:00:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:00:58 UTC::@:[403]:LOG: checkpoint complete: wrote 17
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.752 s, sync=0.005 s, total=1.782 s; sync files=4,
longest=0.005 s, average=0.002 s; distance=65540 kB, estimate=4065870
kB
2024-04-22 15:05:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:05:58 UTC::@:[403]:LOG: checkpoint complete: wrote 12
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.249 s, sync=0.005 s, total=1.277 s; sync files=4,
longest=0.005 s, average=0.002 s; distance=65487 kB, estimate=3665831
kB
2024-04-22 15:10:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:11:00 UTC::@:[403]:LOG: checkpoint complete: wrote 24
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=2.455 s, sync=0.005 s, total=2.483 s; sync files=4,
longest=0.005 s, average=0.002 s; distance=65631 kB, estimate=3305811
kB
2024-04-22 15:15:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:15:59 UTC::@:[403]:LOG: checkpoint complete: wrote 16
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.652 s, sync=0.006 s, total=1.719 s; sync files=4,
longest=0.006 s, average=0.002 s; distance=65541 kB, estimate=2981784
kB
2024-04-22 15:20:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:20:59 UTC::@:[403]:LOG: checkpoint complete: wrote 17
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.751 s, sync=0.006 s, total=1.783 s; sync files=4,
longest=0.006 s, average=0.002 s; distance=65522 kB, estimate=2690158
kB
2024-04-22 15:25:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:25:58 UTC::@:[403]:LOG: checkpoint complete: wrote 6
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=0.645 s, sync=0.004 s, total=0.673 s; sync files=3,
longest=0.004 s, average=0.002 s; distance=65456 kB, estimate=2427688
kB
2024-04-22 15:30:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:30:59 UTC::@:[403]:LOG: checkpoint complete: wrote 15
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.550 s, sync=0.005 s, total=1.580 s; sync files=4,
longest=0.005 s, average=0.002 s; distance=65573 kB, estimate=2191476
kB
2024-04-22 15:35:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:35:59 UTC::@:[403]:LOG: checkpoint complete: wrote 17
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.751 s, sync=0.005 s, total=1.779 s; sync files=4,
longest=0.005 s, average=0.002 s; distance=65541 kB, estimate=1978883
kB
2024-04-22 15:40:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:40:58 UTC::@:[403]:LOG: checkpoint complete: wrote 7
buffers (0.0%); 0 WAL file(s) added, 2 removed, 0 recycled;
write=0.746 s, sync=0.004 s, total=0.778 s; sync files=3,
longest=0.004 s, average=0.002 s; distance=131032 kB, estimate=1794098
kB
2024-04-22 15:45:57 UTC::@:[403]:LOG: checkpoint starting: time
2024-04-22 15:45:59 UTC::@:[403]:LOG: checkpoint complete: wrote 11
buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled;
write=1.148 s, sync=0.005 s, total=1.180 s; sync files=3,
longest=0.005 s, average=0.002 s; distance=65533 kB, estimate=1621241
kB
2024-04-22 15:48:33 UTC::@:[847]:FATAL: terminating logical
replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG: background worker "logical
replication worker" (PID 847) exited with exit code 1
2024-04-22 15:48:33 UTC::@:[848]:FATAL: terminating logical
replication worker due to administrator command
2024-04-22 15:48:33 UTC::@:[400]:LOG: background worker "logical
replication worker" (PID 848) exited with exit code 1

3. This log was got for a table size of 39 GB but we have tables much
bigger than this.

Our workflow for the tables size 39 GB

1) Download schema from the source database instance

2) Deleted PK, FK, and Indexes from the table schema(we did it to
increase data load, startup process)
3)Upload the schema to the destination DB.
4) configurated identity replication full at source DB for this table
5) Configured Logical replication between source and destination DB this table
6) During catchup on this table process we got the messages in log
that you can see above.

7) We also tried create primary key for this table during catchup
state but this process was blocked by logical replication worker so we
had to cancel primary key creation process.

I want to highlight that we used this workflow for PostgreSQL 10 and it worked.

4. Can you clarify the point 'One important point. If there is no
request to source DB logical replication works fine for big tables.'?

I meant that if the source database doesn't have delete, insert,
update queries catch up process very short and we don't have this
problem.

Thank you in advance.

пт, 19 апр. 2024 г. в 14:51, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>:

> Hi,
>
> > Hello, Shlok Kyal. Thank you very much for your answer.
> > Yes, we tried to increase wal_sender_timeout and wal_receiver_timeout
> from 30seconds up to 300 seconds for source and destination DB but we got
> the same error.
> > We set REPLICA IDENTITY FULL on Postgresql 13 because we got an error at
> destination DB "ERROR: logical replication target relation "public.table"
> has neither REPLICA IDENTITY index nor PRIMARY KEY and published relation
> does not have REPLICA IDENTITY FULL". We didn't have such errors on
> Postgresql 10.
> >
> >
> > ср, 17 апр. 2024 г. в 13:48, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>:
> >>
> >> Hi,
> >>
> >> On Mon, 15 Apr 2024 at 13:09, PG Bug reporting form
> >> <noreply(at)postgresql(dot)org> wrote:
> >> >
> >> > The following bug has been logged on the website:
> >> >
> >> > Bug reference: 18433
> >> > Logged by: Kostiantyn
> >> > Email address: tomahkvt(at)gmail(dot)com
> >> > PostgreSQL version: 13.14
> >> > Operating system: AWS RDS
> >> > Description:
> >> >
> >> > On Postgresql 10 we used the following approach for the big tables:
> >> > 1) Download schema from the source database instance
> >> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
> >> > schema
> >> > 3)Upload the schema to the destination DB.
> >> > 4) Configure Logical replication between source and destination DB.
> >> > 5) When last table logical replication table synchronization worker
> for
> >> > subscription "db_name_public_subscription", table "table_name" has
> >> > finished
> >> > 6) we created all the necessary PK, FK, and Indexes.
> >> > This approach allowed to us upload data more quickly. This approach
> was
> >> > working great on PostgreSQL 10.
> >> >
> >> > We tried the same approach for Postgresql13, but we got an error.
> >> > 1) Download schema from the source database instance
> >> > 2) Deleted PK, FK, and Indexes for tables bigger than 100Gb from the
> >> > schema
> >> > 3)Upload the schema to the destination DB.
> >> > 4) configurated identity replication full at source DB for tables
> bigger
> >> > than 100Gb
> >> > 5) Configured Logical replication between source and destination DB.
> >> > 6) During catchup on this big table process we got the following
> >> > messages:
> >> > Source DB
> >> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:LOG:
> terminating
> >> > walsender process due to replication timeout
> >> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:CONTEXT:
> slot
> >> > "db_name_public_subscription", output plugin "pgoutput", in the begin
> >> > callback, associated LSN 13705/2E913050
> >> > 2024-04-08 15:38:34 UTC:(27994):replication_role@:[22047]:STATEMENT:
> >> > START_REPLICATION SLOT "db_name_public_subscription" LOGICAL
> 13702/C2C8FB30
> >> > (proto_version '1', publication_names '"db_name_public_publication"')
> >> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:LOG:
> terminating
> >> > walsender process due to replication timeout
> >> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:CONTEXT:
> slot
> >> > "db_name_public_subscription_18989108_sync_17127", output plugin
> "pgoutput",
> >> > in the begin callback, associated LSN 13703/27942B70
> >> > 2024-04-08 15:38:34 UTC:(36862):replication_role@:[22811]:STATEMENT:
> >> > START_REPLICATION SLOT
> "db_name_public_subscription_18989108_sync_17127"
> >> > LOGICAL 13703/17622B58 (proto_version '1', publication_names
> >> > '"db_name_public_publication"')
> >> >
> >> > One important point. If there is no request to source DB logical
> replication
> >> > works fine for big tables.
> >> > I saw the messages in PostgreSQL bugs like
> >> >
> https://www.postgresql.org/message-id/flat/718213.1601410160%40sss.pgh.pa.us#7e61dd07661901b505bcbd74ce5f5f28
> >> > But I also did some tests and increased wal_keep_size
> >> > and max_slot_wal_keep_size to 1GB. And I set wal_sender_timeout to 1h
> but
> >> > without success. The setup works in PG 13 only with a small amount of
> data.
> >>
> >> I went through the issue and I think that the given logs are appearing
> >> due to some delay in the Apply Worker process.
> >> I could reproduce it in my development environment by applying delays
> >> in Apply Worker.
> >>
> >> I think this issue can be resolved by setting 'wal_sender_timeout' to
> >> a greater value. Can you try setting 'wal_sender_timeout' to a greater
> >> value?
> >>
> >> Also, I noticed that while using Postgresql13 you are configuring the
> >> table in sourceDB as REPLICA IDENTITY FULL but not doing the same in
> >> Postgresql10. Is there any specific reason for it?
> >> I pointed it out because REPLICA IDENTITY FULL has a different
> >> behaviour and sets the entire row as key.
>
> Few suggestions:
> 1. Can you try increasing wal_sender_timeout and wal_receiver_timeout
> to more than 1hrs? As in [1] you mentioned you were getting the same
> issue for wal_sender_timeout set to 1hr.
> 2. Do you find any other unusual logs in SourceDB or any errors logs
> in DestinationDB, If so please share them ?
> 3. Is it possible to share a reproducible testcase for the issue?
> 4. I observe that if we take a lock on table in DestinationDB which is
> subscribed to publication in SourceDB, we may end up getting the same
> logs as you mentioned in [1]. Can you check if anything similar is
> happening?
> 5. Can you clarify the point 'One important point. If there is no
> request to source DB logical replication works fine for big tables.'?
>
> [1]:
> https://www.postgresql.org/message-id/18433-538407560ac0c70b%40postgresql.org
>
> Thanks and Regards
> Shlok Kyal
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-04-23 10:10:25 BUG #18445: date_part / extract range for hours do not match documentation
Previous Message Thomas Munro 2024-04-23 07:48:12 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows