Quick Links

Small bug in replication lag tracking

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Small bug in replication lag tracking
Date:	2017-06-23 05:45:26
Message-ID:	CAEepm=3tJX_0kSeDi8OYTMp8NogrqPxgP1+2uzsdePz9i0-V0Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I discovered a thinko in the new replication lag interpolation code
that can cause a strange number to be reported occasionally.

The interpolation code is designed to report increasing lag when
replay gets stuck somewhere between two LSNs for which we have
timestamp samples. The bug is that after sitting idle and fully
replayed for a while and then encountering a new burst of WAL
activity, we interpolate between an ancient sample and the
not-yet-reached one for the new traffic, which is inappropriate. It's
hard to see obviously strange lag times, because they normally only
exist for a very short moment in between receiving the first and
second replies from the standby, and they often look reasonable even
if you do manage to catch one in pg_stat_replication. You can see the
problem by pausing replay on the the standby in between two bursts of
WAL with a long period of idleness in between.

Please find attached a patch to fix that, with comments to explain.

--
Thomas Munro
http://www.enterprisedb.com

Attachment	Content-Type	Size
fix-lag-interpolation-bug.patch	application/octet-stream	2.1 KB

Responses

Re: Small bug in replication lag tracking at 2017-06-23 07:18:41 from Simon Riggs

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Yugo Nagata	2017-06-23 05:49:34	Re: [POC] hash partitioning
Previous Message	Masahiko Sawada	2017-06-23 05:44:25	Re: Fix a typo in partition.c