From: | Don Seiler <don(at)seiler(dot)us> |
---|---|
To: | pgsql-admin <pgsql-admin(at)postgresql(dot)org> |
Subject: | Streaming Replication Lag and pg_last_xlog_receive_location() question |
Date: | 2018-07-23 16:07:51 |
Message-ID: | CAHJZqBC4Hd3jSDh0HX_i89gpNM0miCLrXHeKxBBuy6sf9Mm8kQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Good morning listers,
The past couple of weeks I've observed lag to my remote DR standby for no
apparent reason. The local HA standby is caught up just fine, and there is
no extraordinary write activity on the primary this morning.
Running this query from the primary:
select client_addr as client, usename as user, application_name as appname,
state, sync_state as mode,
(pg_xlog_location_diff(pg_current_xlog_location(), sent_location) /
1024)::int as pending,
(pg_xlog_location_diff(sent_location, write_location) / 1024)::int as
write,
(pg_xlog_location_diff(write_location, flush_location) / 1024)::int as
flush,
(pg_xlog_location_diff(flush_location, replay_location) / 1024)::int as
replay,
(pg_xlog_location_diff(pg_current_xlog_location(), replay_location) /
1024)::int as total_lag
from pg_stat_replication;
client | user | appname | state | mode | pending |
write | flush | replay | total_lag
---------------+------------+-------------+-----------+-------+---------+-------+-------+--------+-----------
10.0.100.99 | replicator | walreceiver | streaming | async | 0 |
0 | 0 | 24 | 24
10.1.100.99 | replicator | walreceiver | streaming | async | 616433 |
768 | 0 | 128 | 617329
The 10.1 network is the DR site. You can see the "pending" column building
up, representing WAL data that hasn't been sent over the network yet. Given
that the local replica is doing just fine, my assumption is that it is
something wrong with the networking between data centers. My network admin
tells me everything is fine and that the circuit is nowhere near capacity.
It definitely doesn't seem to be a throughput issue in this case given that
there is no extraordinary load on the primary. The lag will wax and wane
and be gone for a few hours and then return. Last time it did this for 2-3
days and then went away altogether until this weekend again. This
"unexplained lag" just started last week. We've previously had expected lag
when running large DELETE operations, but the DB catches up from those and
then we'll see this unexpected lag hours or days later.
None of our other streaming replica setups on other DB clusters are having
problems, but they are even much more quiet than this one.
I'm curious what tools I might be missing from my brain to properly
diagnose this as either a network issue or something else and focus my
effort appropriately. Nothing changed in the DB or network stack
configurations that I'm aware of in the past two weeks.
One other thing: when I ran location check on the lagging standby, I
observed an interesting phenomenon. The value of
pg_last_xlog_receive_location() always ends in "6000" when it is lagging
like this. When things are fine it is any seemingly random set of
characters (confirmed on the local HA replica).
now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:03.64908-05 | 1679/6F326000 | 1679/6F325FE8
now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:07.331111-05 | 1679/6F3A6000 | 1679/6F3A5FF0
now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:11.706098-05 | 1679/6F446000 | 1679/6F444180
now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:15.562082-05 | 1679/6F4C6000 | 1679/6F4C5FC0
now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:49:18.394078-05 | 1679/6F546000 | 1679/6F545FF8
now | receive_location | replay_location
-------------------------------+------------------+-----------------
2018-07-23 10:50:28.328077-05 | 1679/6FF06000 | 1679/6FF04680
It was also like this last week when we saw similar "mystery" lag, receive
location ending in "6000" every time. Just curious about that one.
Thanks,
Don.
--
Don Seiler
www.seiler.us
From | Date | Subject | |
---|---|---|---|
Next Message | Sohel Tamboli | 2018-07-23 17:14:44 | Upgrade postgresql from 9.1 to 9.6 |
Previous Message | Alfonso Moscato (Merqurio) | 2018-07-21 16:11:18 | Re: repeated out of shared memory error - not related to max_locks_per_transaction |