Re: Monitoring logical replication

From: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
To: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
Cc: pgsql-general list <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Monitoring logical replication
Date: 2024-09-03 13:01:50
Message-ID: CAHAc2jeWZ1y4S+ibkGJ8uh7QXyv8Z-O=AAO5FKz2ZX3ZysmM7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Mostly to close the loop on this, now that I have things going seemingly
reliably...

On Tue, 18 Jun 2024 at 14:33, Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:

> On Tue, Jun 18, 2024 at 5:03 AM Shaheed Haque <shaheedhaque(at)gmail(dot)com>
> wrote:
>
>> Hi all,
>>
>> Is there an "official" pairing of LSN values on the publication and
>> subscription sides that should be used to track the delta between the two
>> systems? I ask because Google is full of different pairs being used. I
>> tried to identify the highest level interface points exposed, i.e. what is
>> documented on
>> https://www.postgresql.org/docs/current/replication-origins.html, the
>> pg_stat_subscription table, the pg_stat_publication table and the
>> pg_current_wal_lsn() function on the publisher, but these seem to be barely
>> used.
>>
>
> The attached scripts (whose guts I took from a Stack Exchange post) might
> be a good starting point. It certainly works for physical replication!
>
>
>> P.S. On a related note, I see a (stalled?) discussion on providing LSN
>> -> timestamp conversion
>> <https://www.postgresql.org/message-id/flat/CAAKRu_bw7Pgw8Mi9LJrBkFvPPHgvVjPphrT8ugbzs-2V0f%2B1Rw%40mail.gmail.com#8540282228634ecd061585867c6275ca>,
>> I'd just like to say that something like that would be very useful.
>>
>
> Out of curiosity, how does that work? Is an instance's initial LSN really
> based on Epoch?
>

According to the docs at
https://www.postgresql.org/docs/current/datatype-pg-lsn.html, the LSN is "a
64-bit integer, representing a byte position in the write-ahead log
stream", so I guess some black magic is required to turn that into a
timestamp.

My use case might not be all that common, as I am NOT interested in a
long-term replica. What I am doing is making a copy of an "old" Django
deployment to a "new" deployment such that when the two ends are in close
sync, I can freeze traffic to the old deployment, pause for any final
catchup, and then run a Django migration on the new, before switching on
the new (thereby minimising the down time for the app). That being said:

- It turns out that one cannot use the LSN alone to check for sync
since, during the initial "full table copying" phase of the replication,
the LSNs at the two ends seem to be the same.
- Instead, I track three metrics on each end:
- On the publication end:
- "count(*)" for each table
- pg_current_wal_lsn()
- tuples_processed from pg_stat_progress_copy for each table
- On the subscription end:
- "count(*)" for each table
- latest_end_lsn from pg_stat_subscription
- tuples_processed from pg_stat_progress_copy for each table
- The sync is considered complete when all three metrics are aligned.
- I then freeze activity on the "old" deployment, wait for any in-flight
metric changes to come through, and then run the migration on the "new"
deployment.

As always perhaps that will be of use to somebody, but YMMV.

Thanks all,

Shaheed

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dominique Devienne 2024-09-03 14:49:50 Re: How to grant role to other user
Previous Message Heikki Linnakangas 2024-09-03 10:04:43 Re: PG17 optimizations to vacuum