Re: Postgres Logical Replication - how to see what subscriber is doing with received data?

From: Muhammad Ikram <mmikram(at)gmail(dot)com>
To: Shaheed Haque <shaheedhaque(at)gmail(dot)com>
Cc: Michael Jaskiewicz <mjaskiewicz(at)ghx(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Postgres Logical Replication - how to see what subscriber is doing with received data?
Date: 2024-09-02 08:45:16
Message-ID: CAGeimVrgXKejvzGB+Dks9ieZtnYYOF6RHLC5mDBxPCb-KLZLVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Shaheed,
I think you must have already analyzed the outcome of queries
on pg_replication_slots, pg_current_wal_lsn(), pg_stat_subscription etc. I
could find a query SELECT
pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>',
'<subscriber_replayed_lsn>'));

As a side note if you want to see what has been applied to subscribers vs
what exists on publisher then here is something from my previous
experience. We used to have a Data Validation tool for checking tables/rows
across publisher/subscriber. We also used pg_dump for another tool that was
meant for making copies of schemas.

Regards,
Muhammad Ikram

On Mon, Sep 2, 2024 at 12:42 PM Shaheed Haque <shaheedhaque(at)gmail(dot)com>
wrote:

> Hi Muhammad,
>
> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmikram(at)gmail(dot)com> wrote:
>
>> Hi Shaheed,
>>
>> Maybe these considerations could help you or give any hint to the problem
>> ?
>>
>>
>> Check if wal_receiver_timeout being set to 0 could potentially cause
>> issues, like not detecting network issues quickly enough. Consider
>> re-evaluating this setting if you see connection issues.
>>
>> If you notice that some data is missing on subscriber then could you
>> increase max_slot_wal_keep_size on publisher so that WALs are not deleted
>> until they are applied on subscriber.
>>
>> Do you have flexibility to increase max_worker_processes and
>> max_logical_replication_workers, work_mem and maintenance_work_mem on
>> subscriber (In case bottleneck exists on subscriber)
>>
>> If there's significant lag, consider whether it might be more efficient
>> to drop the subscription and re-initialize it from scratch using a new base
>> backup, depending on the data volume and how long it might take for the
>> existing replication to catch up.
>>
>
> Thanks for the kind hints, I'll certainly look into those.
>
> My main interest however was with the "visibility" question, i.e. to get
> an understanding of the gap between the two ends of a replication slot,
> ideally in human terms (e.g. tables x records).
>
> I understand the difficulties of trying to produce a meaningful metric
> that spans two (or more) systems but let's be honest, trying to diagnose
> which knobs to tweak (whether in application, PG, the OS or the network) is
> basically black magic when all we really have is a pair of opaque LSNs.
>
>
>
>
>>
>> Regards,
>> Muhammad Ikram
>>
>>
>> On Sun, Sep 1, 2024 at 9:22 PM Shaheed Haque <shaheedhaque(at)gmail(dot)com>
>> wrote:
>>
>>> Since nobody more knowledgeable has replied...
>>>
>>> I'm very interested in this area and still surprised that there is no
>>> official/convenient/standard way to approach this (see
>>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
>>> ).
>>>
>>> Based partly on that thread, I ended up with a script that connects to
>>> both ends of the replication, and basically loops while comparing the
>>> counts in each table.
>>>
>>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz(at)ghx(dot)com>
>>> wrote:
>>>
>>>> I've got two Postgres 13 databases on AWS RDS.
>>>>
>>>> - One is a master, the other a slave using logical replication.
>>>> - Replication has fallen behind by about 350Gb.
>>>> - The slave was maxed out in terms of CPU for the past four days
>>>> because of some jobs that were ongoing so I'm not sure what logical
>>>> replication was able to replicate during that time.
>>>> - I killed those jobs and now CPU on the master and slave are both
>>>> low.
>>>> - I look at the subscriber via `select * from
>>>> pg_stat_subscription;` and see that latest_end_lsn is advancing albeit very
>>>> slowly.
>>>> - The publisher says write/flush/replay lags are all 13 minutes
>>>> behind but it's been like that for most of the day.
>>>> - I see no errors in the logs on either the publisher or subscriber
>>>> outside of some simple SQL errors that users have been making.
>>>> - CloudWatch reports low CPU utilization, low I/O, and low network.
>>>>
>>>>
>>>>
>>>> Is there anything I can do here? Previously I set wal_receiver_timeout
>>>> timeout to 0 because I had replication issues, and that helped things. I
>>>> wish I had *some* visibility here to get any kind of confidence that
>>>> it's going to pull through, but other than these lsn values and database
>>>> logs, I'm not sure what to check.
>>>>
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> mj
>>>>
>>>
>>
>> --
>> Muhammad Ikram
>>
>>

--
Muhammad Ikram

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Shaheed Haque 2024-09-02 13:27:24 Re: Postgres Logical Replication - how to see what subscriber is doing with received data?
Previous Message Shaheed Haque 2024-09-02 07:42:05 Re: Postgres Logical Replication - how to see what subscriber is doing with received data?