Monitoring logical replication

From: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
To: pgsql-general list <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Monitoring logical replication
Date: 2023-10-07 15:31:46
Message-ID: CAHAc2jdAHvp7tFZBP37awcth=T3h5WXCN9KjZOvuTNJaAAC_hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I've been playing with logical replication (currently on PG14),
specifically in an AWS RDS Postgres context, but NOT using AWS' own
replication tooling. I'm generally familiar with the challenges of
distributed systems (such causality, time synchronisation etc), but not
especially familiar with PG.

In looking at how to tell how a given subscriber has caught up with its
publisher, there is plenty of advice around the Web, for example
https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn.
Like this example, much advice ends up talking about using separate queries
on the publisher and the subscriber to compare LSNs. First, (I think) I
understand the core difficulty that comparing LSNs is inherently racy, but
given that, I'm a bit unclear as to why a single query on the publisher is
not enough...IIUC:

- Changes sent from the publisher to the subscriber are identified by
LSN.
- The publisher knows it's own current latest LSN (pg_current_wal_lsn()),
but this seems not to be exposed at the subscriber.
- The subscriber knows what it has applied locally and even tells the
publisher (pg_stat_subscription.latest_end_lsn), but it does not seem to
be exposed at the publisher.

Have I missed something? Is there a way to track the LSN delta (given that
this is known to be racy) just by querying one end?

Second, how do folk "know" when replication is "done". For example, if the
two LSNs continued to match for 1 * replication lag? Or N * replication
lag? What would be a plausible N?

Third, as we know when logical replication is started, the initial table
state is captured in a snapshot, and sent across using COPY TABLE under the
covers. Now, let's say that the publisher is idle (i.e. no SQL writes to
the user's schema...obviously pg_catalog might change as replication is
configured and enabled) and that the replication starts with the publisher
as LSN_start. How could one know when the copying is done:

- I initially assumed that the publisher's LSN would not change from
LSN_start, but as the copying proceeds, I see that it DOES change
(presumably because there are updates happening to pg_catalog, such as the
temporary slots coming and going).
- Is there some kind of singleton state on either publisher or
subscriber that could be checked to know? (At the moment, I am counting the
records in all copied tables).

I realise that the knowledge that the publisher is "idle" is a special
case, but right now, my test for being "done" is:

- Number of records in copied tables matches AND the publisher's
pg_stat_subscription matches the subscriber's pg_stat_subscription.
latest_end_lsn.

Plus or minus the bit about replication lag, is there a better way?

Thanks, Shaheed

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Steve Baldwin 2023-10-07 22:44:13 psql trying twice to connect to local DB
Previous Message Bruce Momjian 2023-10-06 21:11:57 Re: Gradual migration from integer to bigint?