Re: Need help debugging slow logical replication

From: Justin <zzzzz(dot)graf(at)gmail(dot)com>
To: "sunyucong(at)gmail(dot)com" <sunyucong(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Need help debugging slow logical replication
Date: 2023-02-08 01:32:35
Message-ID: CALL-XeMsnc2Nz9XdaG1z7Lor8oYyZ=4PAiN953rMagkeKep+xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Feb 7, 2023 at 8:07 PM sunyucong(at)gmail(dot)com <sunyucong(at)gmail(dot)com>
wrote:

> Hi Justin:
>
> - i checked that I have 2 tables using replication identity FULL, but
> one table is empty and one table has only 1 row
> - 7 tables using index
> - overall I have ~100 tables in the publication: But I do find that
> some tables doesn't have PK , but using replica identity default, I'm
> abit confused since the doc mentioned that that means it is using PK,
> does that mean it is using FULL?
>
> Cheers.
>
> On Tue, Feb 7, 2023 at 4:20 PM Justin <zzzzz(dot)graf(at)gmail(dot)com> wrote:
> >
> >
> > On Tue, Feb 7, 2023 at 6:38 PM sunyucong(at)gmail(dot)com <sunyucong(at)gmail(dot)com>
> wrote:
> >>
> >> Hi there,
> >>
> >> I am using PG 14.14 on both primary and secondary DB on AWS, setup
> >> using a logical replication, I'm having trouble with huge replication
> >> lag.
> >>
> >> My setup is as follows:
> >>
> >> P1 - physical - P1-R
> >> | (logical)
> >> P2 - physical - P2-R
> >>
> >>
> >> The lag between P1 & P1-R , P2 & P2-R are both minimal , less than
> >> 10seconds , but I'm seeing several hours of lag between P1 & P2 .We
> >> currently have 1 pub/sub that covers about 100 tables.
> >>
> >> Here is the output from P1 showing there is a Lag of at least 80GB
> >> (keep growing)
> >>
> >> > SELECT now() AS CURRENT_TIME,
> >> slot_name,
> >> active,
> >> active_pid,confirmed_flush_lsn,
> >> pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
> >> confirmed_flush_lsn)) AS diff_size,
> >> pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS
> diff_bytes
> >> FROM pg_replication_slots
> >> WHERE slot_type = 'logical';
> >> -[ RECORD 1 ]-------+---------------------------
> >> current_time | 2023-02-07 23:26:00.733+00
> >> slot_name | upgrade
> >> active | t
> >> active_pid | 5180
> >> confirmed_flush_lsn | 26B09/8C08C610
> >> diff_size | 81 GB
> >> diff_bytes | 86573472240
> >>
> >> Here is what pg_stat_replication shows: note that the write_lag is
> >> very high: we previously had to set wal_sender_timeout to 0, otherwise
> >> the logical replication work keep exiting and fail.
> >>
> >> > select * from pg_stat_replication;
> >> -[ RECORD 1 ]----+------------------------------
> >> pid | xxx
> >> usesysid | xxx
> >> usename | dev
> >> application_name | upgrade_target
> >> client_addr | 10.xxx
> >> client_hostname |
> >> client_port | 27404
> >> backend_start | 2023-02-07 23:02:39.228572+00
> >> backend_xmin | 2909692747
> >> state | catchup
> >> sent_lsn | 26B09/8C08C610
> >> write_lsn | 26B09/840514C0
> >> flush_lsn | 26B09/840514C0
> >> replay_lsn | 26B09/840514C0
> >> write_lag | 00:07:03.60362
> >> flush_lag | 00:07:03.60362
> >> replay_lag | 00:07:03.60362
> >> sync_priority | 0
> >> sync_state | async
> >> reply_time | 2023-02-07 23:22:08.245066+00
> >>
> >> So, the problem seems to be like P2 takes a very long time to apply a
> >> chunk, Mow, my question is why? From AWS performance insight, it
> >> looks like there is only 1 process running on P2, that is 100% blocked
> >> by CPU. I don't really understand why it is being blocked by the CPU,
> >> and any help in further debugging is much appreciated.
> >>
> >> Cheers.
> >
> >
> > Hi Sir,
> >
> > What is the replica identity being used on the tables? Are any of the
> tables using REPLICA IDENTITY FULL ?
> >
> > How many tables are being replicated?
>

Setting REPLICA IDENTITY FULL causes FULL TABLE SCANS for UPDATES and
DELETES. REPLICA IDENTITY FULL should only be used on INSERT only or
tables that see very few updates and deletes. REPLICA IDENTITY FULL will
cause replaying of the WAL to be very slow if there are many
updates/deletes for those tables.

REPLICA IDENTITY DEFAULT will only use primary keys, if the publisher
includes those tables, the subscriber when replaying the WAL will stop
throwing an error not knowing how to replay the UPDATE/DELETE.

Logical Replication is most likely broken at this point.

I suggest stopping logical replication and correcting tables that don't
have qualifying indexes for logical replication by creating the
necessary indexes and avoid using replica identity full. Then restart
logical replication from the beginning.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message sunyucong@gmail.com 2023-02-08 02:10:54 Re: Need help debugging slow logical replication
Previous Message sunyucong@gmail.com 2023-02-08 01:07:25 Re: Need help debugging slow logical replication