From: | Mark Steben <mark(dot)steben(at)drivedominion(dot)com> |
---|---|
To: | Keith <keith(at)keithf4(dot)com> |
Cc: | pgsql-admin <pgsql-admin(at)postgresql(dot)org> |
Subject: | Re: Streaming replication question |
Date: | 2019-08-04 11:06:32 |
Message-ID: | CADyzmyyG6i9fBcBVPXKzRZLBp23Sb9OWGpYYdYQT1iPNU8dYxw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Thank your for your prompt response Keith. You were correct. The root of
the problem was an ssh host key change which caused all logshipping to
error, and therefore pg_xlog queued up.
On Sat, Aug 3, 2019 at 11:23 AM Keith <keith(at)keithf4(dot)com> wrote:
>
>
> On Sat, Aug 3, 2019 at 9:36 AM Mark Steben <mark(dot)steben(at)drivedominion(dot)com>
> wrote:
>
>> Good morning,
>> We run postgres 9.4. Early Saturday morning we had a production postgres
>> outage because our pg_xlog directory ran out of space. Tracing the cause
>> points to a scheduled reboot of the 3 database servers by our IT team to
>> install some monitoring software. Because we did a simple stop/restart of
>> the database our replication slots did not get reset and pg_logs queued up
>> until the out-of-space condition
>>
>> Can someone please verify that the following WOULD HAVE BEEN the
>> appropriate action:
>> (Or offer corrections)
>> 1. Stop the postgres database (on both slaves)
>> 2. Run pg_drop_replication_slot (rep1, rep2 slotnames) (on Master)
>> 3. Stop the postgres database (on master)
>> 4. Reboot Linux Server to bring in monitoring software
>> 5. Start the postgres database (on master)
>> 6. Run pg_create_replication_slot (rep1, rep2 slotnames) (On Master)
>> 7. Start the postgres database (on both slaves)
>>
>> Thank you
>>
>> --
>> *Mark Steben*
>> Database Administrator
>> @utoRevenue <http://www.autorevenue.com/> | Autobase
>> <http://www.autobase.net/>
>> CRM division of Dominion Dealer Solutions
>> 95D Ashley Ave.
>> West Springfield, MA 01089
>> t: 413.327-3045
>> f: 413.383-9567
>>
>> www.fb.com/DominionDealerSolutions
>> www.twitter.com/DominionDealer
>> www.drivedominion.com <http://www.autorevenue.com/>
>>
>> <http://autobasedigital.net/marketing/DD12_sig.jpg>
>>
>>
>>
> Restarting your databases should not have affected the replication slots
> like this. If you stopped the replicas first, then the primary would have
> just started keeping all the WAL files until the replicas came back,
> resuming from where they left off. If you stopped the primary first then
> the replicas would have just lost their connection until the primary came
> back and made its slots available again.
>
> The only time the slots don't stick around is if you do a failover from
> the primary to one of the replicas. In that case, yes, you do have to
> recreate the slots. When things were rebooted, was there any failovers
> kicked off?
>
> I'd check your postgres logs to see if there's any hint as to why things
> turned out the way they did. If all systems were rebooted at the exact same
> time, there may be some edge-case bug that's not being accounted for. But
> without a deeper dive into what happened, that would be hard to say and
> seems unlikely. I would say to make sure you are on the most recent version
> of 9.4 to ensure any known bug fixes are in place. Also highly recommend on
> planning to upgrade to a newer major version. 9.4 goes out of support this
> fall upon the release of version 12.
>
> Keith
>
--
*Mark Steben*
Database Administrator
@utoRevenue <http://www.autorevenue.com/> | Autobase
<http://www.autobase.net/>
CRM division of Dominion Dealer Solutions
95D Ashley Ave.
West Springfield, MA 01089
t: 413.327-3045
f: 413.383-9567
www.fb.com/DominionDealerSolutions
www.twitter.com/DominionDealer
www.drivedominion.com <http://www.autorevenue.com/>
From | Date | Subject | |
---|---|---|---|
Next Message | Wells Oliver | 2019-08-05 16:32:23 | Diagnosing a massive toast file |
Previous Message | Jānis Pūris | 2019-08-04 00:27:39 | HAProxy + Patroni + pgBouncer High Availability setup |