Re: Slave promotion problem...

From: marin(at)kset(dot)org
To: Martín Marqués <martin(at)2ndquadrant(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Slave promotion problem...
Date: 2015-08-31 14:05:26
Message-ID: 02c9f0b656add4fae12ec2453fdc6b84@kset.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2015-08-31 14:38, Martín Marqués wrote:
> El 31/08/15 a las 03:29, marin(at)kset(dot)org escribió:
>> Last week we had some problems on the master server which caused a
>> failover on the slave (the master was completely unresponsive due to
>> reasons still unknown). The slave received the promote signal (pg_ctl
>> promote) and logged that in the logs:
>> 2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG: received promote
>> request
>> 2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL: terminating
>> walreceiver process due to administrator command
>>
>> 5 hours later the slave still didn't promote. Meanwhile we fixed the
>> master and restarted it. The slave was restarted and it behaved just
>> like the promote signal didn't arrive, connecting to the master as a
>> regular slave.
>
> Aren't there any further logs after the walreceiver termination?
> Up to here everything looks fine, but we have no idea on what was
> logged
> afterwards.
There are logs (quite a few, cca. 5 hours of it), every second something
like this:
2015-08-28 23:05:12 UTC [79867]: [1-1] user=[unknown],db=[unknown] LOG:
connection received: host=[local]
2015-08-28 23:05:12 UTC [79867]: [2-1] user=postgres,db=postgres LOG:
connection authorized: user=postgres database=postgres
This logs the connection of the process that probes the server is alive.

I was expecting to see something like:
redo done at xxxxx
last completed transaction was at log time xxxxxxx

But those lines didn't appear after 5 hours. As I understand, these are
written before the server uses the restore_command to get WAL and
history files from the archive.

>
>> I am unsure if this promotion failure is a bug/glitch, but the promote
>> procedure is automated and tested a couple of hundred times so I am
>> certain we initiated the promote correctly.
>
> Are you using homemade scripts? Maybe you need to test them more
> thoroughly, with different environment parameters.

We use a custom script for the restore_command, but is seems that it was
not invoked.

Regards,
Mladen Marinović

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ray Stell 2015-08-31 14:24:26 bdr download
Previous Message Melvin Davidson 2015-08-31 13:16:34 Re: PostgreSQL Developer Best Practices