Streaming Replication Lag and pg_last_xlog_receive_location() question

From: Don Seiler <don(at)seiler(dot)us>
To: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Streaming Replication Lag and pg_last_xlog_receive_location() question
Date: 2018-07-23 16:07:51
Message-ID: CAHJZqBC4Hd3jSDh0HX_i89gpNM0miCLrXHeKxBBuy6sf9Mm8kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Good morning listers,

The past couple of weeks I've observed lag to my remote DR standby for no
apparent reason. The local HA standby is caught up just fine, and there is
no extraordinary write activity on the primary this morning.

Running this query from the primary:

select client_addr as client, usename as user, application_name as appname,
state, sync_state as mode,
(pg_xlog_location_diff(pg_current_xlog_location(), sent_location) /
1024)::int as pending,
(pg_xlog_location_diff(sent_location, write_location) / 1024)::int as
write,
(pg_xlog_location_diff(write_location, flush_location) / 1024)::int as
flush,
(pg_xlog_location_diff(flush_location, replay_location) / 1024)::int as
replay,
(pg_xlog_location_diff(pg_current_xlog_location(), replay_location) /
1024)::int as total_lag
from pg_stat_replication;
client | user | appname | state | mode | pending |
write | flush | replay | total_lag
---------------+------------+-------------+-----------+-------+---------+-------+-------+--------+-----------
10.0.100.99 | replicator | walreceiver | streaming | async | 0 |
0 | 0 | 24 | 24
10.1.100.99 | replicator | walreceiver | streaming | async | 616433 |
768 | 0 | 128 | 617329

The 10.1 network is the DR site. You can see the "pending" column building
up, representing WAL data that hasn't been sent over the network yet. Given
that the local replica is doing just fine, my assumption is that it is
something wrong with the networking between data centers. My network admin
tells me everything is fine and that the circuit is nowhere near capacity.
It definitely doesn't seem to be a throughput issue in this case given that
there is no extraordinary load on the primary. The lag will wax and wane
and be gone for a few hours and then return. Last time it did this for 2-3
days and then went away altogether until this weekend again. This
"unexplained lag" just started last week. We've previously had expected lag
when running large DELETE operations, but the DB catches up from those and
then we'll see this unexpected lag hours or days later.

None of our other streaming replica setups on other DB clusters are having
problems, but they are even much more quiet than this one.

I'm curious what tools I might be missing from my brain to properly
diagnose this as either a network issue or something else and focus my
effort appropriately. Nothing changed in the DB or network stack
configurations that I'm aware of in the past two weeks.

One other thing: when I ran location check on the lagging standby, I
observed an interesting phenomenon. The value of
pg_last_xlog_receive_location() always ends in "6000" when it is lagging
like this. When things are fine it is any seemingly random set of
characters (confirmed on the local HA replica).

now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:03.64908-05 | 1679/6F326000 | 1679/6F325FE8

now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:07.331111-05 | 1679/6F3A6000 | 1679/6F3A5FF0

now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:11.706098-05 | 1679/6F446000 | 1679/6F444180

now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:15.562082-05 | 1679/6F4C6000 | 1679/6F4C5FC0

now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:18.394078-05 | 1679/6F546000 | 1679/6F545FF8

now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:50:28.328077-05 | 1679/6FF06000 | 1679/6FF04680

It was also like this last week when we saw similar "mystery" lag, receive
location ending in "6000" every time. Just curious about that one.

Thanks,
Don.

--
Don Seiler
www.seiler.us

Browse pgsql-admin by date

  From Date Subject
Next Message Sohel Tamboli 2018-07-23 17:14:44 Upgrade postgresql from 9.1 to 9.6
Previous Message Alfonso Moscato (Merqurio) 2018-07-21 16:11:18 Re: repeated out of shared memory error - not related to max_locks_per_transaction