Replication lag due to lagging restart_lsn

From: Satyam Shekhar <satyamshekhar(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Replication lag due to lagging restart_lsn
Date: 2020-08-18 16:27:34
Message-ID: CAAy_rtEP_CroVy4Gvcu3HmHxzRTKtYLC2JwNWSdsOPAsvMEyBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I wish to use logical replication in Postgres to capture transactions as
CDC and forward them to a custom sink.

To understand the overhead of logical replication workflow I created a toy
subscriber using the V3PGReplicationStream that acknowledges LSNs after
every 16k reads by calling setAppliedLsn, setFlushedLsn, and forceUpdateState.
The toy subscriber is set up as a subscriber for a master Postgres instance
that publishes changes using a Publication. I then run a write-heavy
workload on this setup that generates transaction logs at approximately
235MBps. Postgres is run on a beefy machine with a 10+GBps network link
between Postgres and the toy subscriber.

My expectation with this setup was that the replication lag on master would
be minimal as the subscriber acks the LSN almost immediately. However, I
observe the replication lag to increase continuously for the duration of
the test. Statistics in pg_replication_slots show that restart_lsn
lags significantly behind
the confirmed_flushed_lsn. Cursory reading on restart_lsn suggests that an
increasing gap between restart_lsn and confirmed_flushed_lsn means that
Postgres needs to reclaim disk space and advance restart_lsn to catch up to
confirmed_flushed_lsn.

With that context, I am looking for answers for two questions -

1. What work needs to happen in the database to advance restart_lsn to
confirmed_flushed_lsn?
2. What is the recommendation on tuning the database to improve the
replication lag in such scenarios?

Regards,
Satyam

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jim Jarvie 2020-08-18 23:52:56 CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Previous Message Justin Pryzby 2020-08-15 00:55:33 Re: Query takes way longer with LIMIT, and EXPLAIN takes way longer than actual query