Re: Metric to calculate WAL size left to transfer to Standby

From: Viral Shah <vshah(at)nodalexchange(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Metric to calculate WAL size left to transfer to Standby
Date: 2021-05-07 20:07:32
Message-ID: CAKL_ojwYZDnCPh7P82OX5pdY6EnpLYFbFGZ2NOm+pra-kueLHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Laurenz,

The above metric works fine for my primary server. However, We have a
cascading setup in our production system. This particular query doesn't
work for my intermediate server which is standby to the primary server but
also a master to one more standby server. We get the following error:

prod=# SELECT round(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
/ (1024 :: NUMERIC ^ 2), 2 ) AS replay_lag FROM pg_stat_replication
where application_name like 'rtv%';
ERROR: recovery is in progress
HINT: WAL control functions cannot be executed during recovery.

I wanted to also track the transfer/replay lag between the intermediate
server and the final standby. I narrowed down the problem with
pg_current_wal_lsn() procedure. Any idea on how to mitigate this? or any
alternative to the pg_current_wal_lsn() procedure that I can use to get
transfer lag?

Thanks,
Viral Shah

On Fri, Apr 16, 2021 at 7:52 PM Viral Shah <vshah(at)nodalexchange(dot)com> wrote:

> Hello Laurenz,
>
> Thank you so much for sending the query. It was exactly what I needed. I
> just made 1 modification to beautify the transfer and replay lag and I can
> see the size in bytes.
>
> SELECT application_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS transfer_lag,
> pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
> FROM pg_stat_replication;
>
> I am now using zabbix to constantly monitor them and notify myself if it breaches a certain threshold.
>
>
> Thanks again!
>
>
> Best,
>
> Viral Shah
>
> Nodal Exchange LLC
>
>
>
> On Thu, Apr 15, 2021 at 8:10 AM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
> wrote:
>
>> On Wed, 2021-04-14 at 17:50 -0400, Viral Shah wrote:
>> > We have a PostgreSQL 10.12 cluster of servers in two different data
>> centers.
>> > Off lately, in the case of a large WAL generation, we are seeing
>> replication
>> > delay between the master and the standby server. These delays have off
>> lately
>> > been there for an unusually long time. I was wondering if we have any
>> metric
>> > that can calculate the amount (size) of WAL transfer left between
>> master and
>> > standby?
>> >
>> > PS: We have ensured we have upgraded our firewalls for better speed
>> transfer.
>> >
>> > Any help on how to figure out the slowness in the WAL transfer would be
>> much appreciated.
>>
>> SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS transfer_lag,
>> pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
>> FROM pg_stat_replication;
>>
>> If both are delayed, it might be that the network cannot cope.
>>
>> If only the second number is delayed, you have replication conflicts
>> with queries on the standby.
>>
>> Yours,
>> Laurenz Albe
>> --
>> Cybertec | https://www.cybertec-postgresql.com
>>
>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2021-05-08 02:23:08 Re: Have I found an interval arithmetic bug?
Previous Message Adrian Klaver 2021-05-07 14:57:37 Re: trigger impacting insertion of records