Re: Need help debugging slow logical replication

From: Justin <zzzzz(dot)graf(at)gmail(dot)com>
To: "sunyucong(at)gmail(dot)com" <sunyucong(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Need help debugging slow logical replication
Date: 2023-02-08 00:19:55
Message-ID: CALL-XeN=CsGpnyOFMLbxnpm06VPwU-S3YyO7UNONcP-qsCAahg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Feb 7, 2023 at 6:38 PM sunyucong(at)gmail(dot)com <sunyucong(at)gmail(dot)com>
wrote:

> Hi there,
>
> I am using PG 14.14 on both primary and secondary DB on AWS, setup
> using a logical replication, I'm having trouble with huge replication
> lag.
>
> My setup is as follows:
>
> P1 - physical - P1-R
> | (logical)
> P2 - physical - P2-R
>
>
> The lag between P1 & P1-R , P2 & P2-R are both minimal , less than
> 10seconds , but I'm seeing several hours of lag between P1 & P2 .We
> currently have 1 pub/sub that covers about 100 tables.
>
> Here is the output from P1 showing there is a Lag of at least 80GB
> (keep growing)
>
> > SELECT now() AS CURRENT_TIME,
> slot_name,
> active,
> active_pid,confirmed_flush_lsn,
> pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
> confirmed_flush_lsn)) AS diff_size,
> pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS
> diff_bytes
> FROM pg_replication_slots
> WHERE slot_type = 'logical';
> -[ RECORD 1 ]-------+---------------------------
> current_time | 2023-02-07 23:26:00.733+00
> slot_name | upgrade
> active | t
> active_pid | 5180
> confirmed_flush_lsn | 26B09/8C08C610
> diff_size | 81 GB
> diff_bytes | 86573472240
>
> Here is what pg_stat_replication shows: note that the write_lag is
> very high: we previously had to set wal_sender_timeout to 0, otherwise
> the logical replication work keep exiting and fail.
>
> > select * from pg_stat_replication;
> -[ RECORD 1 ]----+------------------------------
> pid | xxx
> usesysid | xxx
> usename | dev
> application_name | upgrade_target
> client_addr | 10.xxx
> client_hostname |
> client_port | 27404
> backend_start | 2023-02-07 23:02:39.228572+00
> backend_xmin | 2909692747
> state | catchup
> sent_lsn | 26B09/8C08C610
> write_lsn | 26B09/840514C0
> flush_lsn | 26B09/840514C0
> replay_lsn | 26B09/840514C0
> write_lag | 00:07:03.60362
> flush_lag | 00:07:03.60362
> replay_lag | 00:07:03.60362
> sync_priority | 0
> sync_state | async
> reply_time | 2023-02-07 23:22:08.245066+00
>
> So, the problem seems to be like P2 takes a very long time to apply a
> chunk, Mow, my question is why? From AWS performance insight, it
> looks like there is only 1 process running on P2, that is 100% blocked
> by CPU. I don't really understand why it is being blocked by the CPU,
> and any help in further debugging is much appreciated.
>
> Cheers.
>

Hi Sir,

What is the replica identity being used on the tables? Are any of the
tables using REPLICA IDENTITY FULL ?

How many tables are being replicated?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Brad White 2023-02-08 00:19:56 Fwd: Quoting issue from ODBC
Previous Message Adrian Klaver 2023-02-08 00:19:08 Re: Quoting issue from ODBC