Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
Date: 2021-12-23 17:07:48
Message-ID: CAHg+QDdo6GtxvFZ2SovU_E_3Nqhh15JHrvxiEt0HeVeX0533Mw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 23, 2021 at 5:18 AM Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
wrote:

> On Thu, Dec 23, 2021 at 5:53 AM SATYANARAYANA NARLAPURAM
> <satyanarlapuram(at)gmail(dot)com> wrote:
> >
> > Hi Hackers,
> >
> > I am considering implementing RPO (recovery point objective) enforcement
> feature for Postgres where the WAL writes on the primary are stalled when
> the WAL distance between the primary and standby exceeds the configured
> (replica_lag_in_bytes) threshold. This feature is useful particularly in
> the disaster recovery setups where primary and standby are in different
> regions and synchronous replication can't be set up for latency and
> performance reasons yet requires some level of RPO enforcement.
>
> Limiting transaction rate when the standby fails behind is a good feature
> ...
>
> >
> > The idea here is to calculate the lag between the primary and the
> standby (Async?) server during XLogInsert and block the caller until the
> lag is less than the threshold value. We can calculate the max lag by
> iterating over ReplicationSlotCtl->replication_slots. If this is not
> something we don't want to do in the core, at least adding a hook for
> XlogInsert is of great value.
>
> but doing it in XLogInsert does not seem to be a good idea.

XLogInsert isn't the best place to throttle/govern in a simple and fair
way, particularly the long-running transactions on the server?

> It's a
> common point for all kinds of logging including VACUUM. We could
> accidently stall a critical VACUUM operation because of that.
>

Agreed, but again this is a policy decision that DBA can relax/enforce. I
expect RPO is in the range of a few 100MBs to GBs and on a healthy system
typically lag never comes close to this value. The Hook implementation can
take care of nitty-gritty details on the policy enforcement based on the
needs, for example, not throttling some backend processes like vacuum,
checkpointer; throttling based on the roles, for example not to throttle
superuser connections; and throttling based on replay lag, write lag,
checkpoint taking longer, closer to disk full. Each of these can be easily
translated into GUCs. Depending on the direction of the thread on the hook
vs a feature in the Core, I can add more implementation details.

> As Bharath described, it better be handled at the application level
> monitoring.
>

Both RPO based WAL throttling and application level monitoring can co-exist
as each one has its own merits and challenges. Each application developer
has to implement their own throttling logic and often times it is hard to
get it right.

> --
> Best Wishes,
> Ashutosh Bapat
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-12-23 17:23:43 Re: Buildfarm support for older versions
Previous Message Mark Dilger 2021-12-23 16:31:29 Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.