Re: Critical failure of standby

From: James Sewell <james(dot)sewell(at)jirotech(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Critical failure of standby
Date: 2016-08-16 02:23:11
Message-ID: CAANVwEsmzXgSBq=h7krw-bXBwdVgY9xBHDqDFyYCar1o5159ZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Those are all good questions.

Essentially this is a situation where DR is network separated from Prod -
so I would expect the archive command to fail. I'll have to check the
script it must not be passing the error back through to PostgreSQL.

This still shouldn't cause database corruption though right? - it's just
not getting WALs.

Cheers,

James Sewell,
PostgreSQL Team Lead / Solutions Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On Tue, Aug 16, 2016 at 2:09 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Thu, Aug 11, 2016 at 10:39 PM, James Sewell <james(dot)sewell(at)jirotech(dot)com>
> wrote:
>
>> Hello,
>>
>> We recently experienced a critical failure when failing to a DR
>> environment.
>>
>> This is in the following environment:
>>
>>
>> - 3 x PostgreSQL machines in Prod in a sync replication cluster
>> - 3 x PostgreSQL machines in DR, with a single machine async and the
>> other two cascading from the first machine.
>>
>> There was network failure which isolated Production from everything else,
>> Production has no errors during this time (and has now come back OK).
>>
>> DR did not tolerate the break, the following appeared in the logs and
>> none of them can start postgres. There were no queries coming into DR at
>> the time of the break.
>>
>> Please note that the "Host Key verification failed" messages are due to
>> the scp command not functioning. This means restore_command is not working
>> to restore from the XLOG archive, but should not effect anything else.
>>
>
>
> In my experience, PostgreSQL issues its own error messages when
> restore_command fails. So I see both the error from the command itself,
> and an error from PostgreSQL. Why don't you see that? Is the
> restore_command failing, but then reporting that it succeeded?
>
> And if you can't get files from the XLOG archive, why do you think that
> that is OK?
>
> Cheers,
>
> Jeff
>

--

------------------------------
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2016-08-16 02:36:03 Re: Critical failure of standby
Previous Message Michael Paquier 2016-08-16 00:20:42 Re: pgbasebackup is failing after truncate