Re: Detect when streaming replication stops streaming.

From: Keith Fiske <keith(dot)fiske(at)crunchydata(dot)com>
To: wegge(at)wegge(dot)dk
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Detect when streaming replication stops streaming.
Date: 2018-10-04 14:31:27
Message-ID: CAODZiv5Ytn0XPLwS65_Ow9FbjZc=O+7-5Z4yeWxLAiA6Fau_Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Thu, Oct 4, 2018 at 6:13 AM Anders Wegge Keller <wegge(at)wegge(dot)dk> wrote:

> Is there a reliable way to detect, on the standby, that continuous recovery
> have stopped for any reason? I'm adding a postgres 9.2(1) database to an
> industrial control system that have to run unattended. Thus, I need to plan
> for some really exotic failure modes, like e.g. loss of network
> connectivity
> in an indefinitely long period.
>
> On the master system, I can check for the existence of the proper row in
> pg_stat_replication. If that is missing, I can report a degraded status on
> the whole system. However, on the standby side, I've been unable to find
> something similar. Especially the failure mode where the standby return
> from
> a long period of no network connectivity, where the master no longer have
> the WAL segments needed to catch up. Snooping for FATAL in
> pg_log/Postgres-xxx is error prone, and I'd rather not go the route of
> polling the master system for a situation that should be detectable on the
> standby instance.
>
>
> --
> //Wegge
>
> 1. Yes, I know it's no longer supported, but that's the way of redhat.
>
>
Checking replication status from just the replica itself isn't that
straight forward. This makes sense, though, since how is the replica
supposed to know that it's behind if it doesn't compare itself to something
else? PostgreSQL replication works by simply replaying the WAL stream from
another database. When that WAL stream stops, the replication stops. This
could be normal if the primary hasn't gotten any writes.

I wrote a brief blog post on monitoring replica lag:

https://www.keithf4.com/monitoring_streaming_slave_lag/

There is a query you can run on the replica, but as I said above and in the
blog post, this can cause false positives when there's no writes on the
primary. But if you're always supposed to be getting writes to the primary,
it's the easiest method to check, and could even help indicate when there's
a problem if your primary isn't getting writes as it should be. I highly
recommend also monitoring your primary for byte lag as well, so you can at
least know when the replica is throwing a false positive.

Another option would be, as you hinted, to watch the replica for expected
recent data. But again, if there's no writes, it can cause false positives.
So it's really not much different than just checking for the last WAL
replay.

If you're worried about the replica falling behind due to network outages
or something similar, I recommend looking into WAL shipping in addition to
streaming replication. If postgres doesn't get a successful run result from
the archive_command, it keeps the WAL file around until it does. You will
have to monitor for disk space usage issues on the primary if it backs up
too much. If this happens a lot, you may be better off shipping them to a
third-party server, preferably where you're storing your backups. You can
then have the replica fetch its WAL stream from there if streaming
replication fails. pgbackrest is one tool that can do this for you, as well
as help manage backups. https://pgbackrest.org/

If you're able to upgrade to 9.4, postgres did introduce replication slots
to help the primary actually be aware of the state of its replicas and keep
WAL files around automatically until all its replicas are caught up. While
9.2 is the default version of PG that comes with Redhat/CentOS, the
community does provide repositories with more recent rpm versions.
https://www.postgresql.org/download/linux/redhat/ I would highly recommend
looking into this since Redhat is not going to be updating their default
version for a very long time.

--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Anders Wegge Keller 2018-10-04 21:06:30 Re: Detect when streaming replication stops streaming.
Previous Message Joakim Lundgren 2018-10-04 11:52:50 Re: Takes long time to "fail back" using pg_rewind