Re: Streaming replication question

From: Keith <keith(at)keithf4(dot)com>
To: Mark Steben <mark(dot)steben(at)drivedominion(dot)com>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Streaming replication question
Date: 2019-08-03 15:22:31
Message-ID: CAHw75vu0MXEzdu04grea1=ow2hgMBTmsyN=BLC0Y0-U5jQWWMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Sat, Aug 3, 2019 at 9:36 AM Mark Steben <mark(dot)steben(at)drivedominion(dot)com>
wrote:

> Good morning,
> We run postgres 9.4. Early Saturday morning we had a production postgres
> outage because our pg_xlog directory ran out of space. Tracing the cause
> points to a scheduled reboot of the 3 database servers by our IT team to
> install some monitoring software. Because we did a simple stop/restart of
> the database our replication slots did not get reset and pg_logs queued up
> until the out-of-space condition
>
> Can someone please verify that the following WOULD HAVE BEEN the
> appropriate action:
> (Or offer corrections)
> 1. Stop the postgres database (on both slaves)
> 2. Run pg_drop_replication_slot (rep1, rep2 slotnames) (on Master)
> 3. Stop the postgres database (on master)
> 4. Reboot Linux Server to bring in monitoring software
> 5. Start the postgres database (on master)
> 6. Run pg_create_replication_slot (rep1, rep2 slotnames) (On Master)
> 7. Start the postgres database (on both slaves)
>
> Thank you
>
> --
> *Mark Steben*
> Database Administrator
> @utoRevenue <http://www.autorevenue.com/> | Autobase
> <http://www.autobase.net/>
> CRM division of Dominion Dealer Solutions
> 95D Ashley Ave.
> West Springfield, MA 01089
> t: 413.327-3045
> f: 413.383-9567
>
> www.fb.com/DominionDealerSolutions
> www.twitter.com/DominionDealer
> www.drivedominion.com <http://www.autorevenue.com/>
>
> <http://autobasedigital.net/marketing/DD12_sig.jpg>
>
>
>
Restarting your databases should not have affected the replication slots
like this. If you stopped the replicas first, then the primary would have
just started keeping all the WAL files until the replicas came back,
resuming from where they left off. If you stopped the primary first then
the replicas would have just lost their connection until the primary came
back and made its slots available again.

The only time the slots don't stick around is if you do a failover from the
primary to one of the replicas. In that case, yes, you do have to recreate
the slots. When things were rebooted, was there any failovers kicked off?

I'd check your postgres logs to see if there's any hint as to why things
turned out the way they did. If all systems were rebooted at the exact same
time, there may be some edge-case bug that's not being accounted for. But
without a deeper dive into what happened, that would be hard to say and
seems unlikely. I would say to make sure you are on the most recent version
of 9.4 to ensure any known bug fixes are in place. Also highly recommend on
planning to upgrade to a newer major version. 9.4 goes out of support this
fall upon the release of version 12.

Keith

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Jānis Pūris 2019-08-04 00:27:39 HAProxy + Patroni + pgBouncer High Availability setup
Previous Message Mark Steben 2019-08-03 13:35:55 Streaming replication question