Quick Links

pg9.6 when is a promoted cluster ready to accept "rewind" request?

From:	magodo <wztdyl(at)sina(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	pg9.6 when is a promoted cluster ready to accept "rewind" request?
Date:	2018-11-12 05:11:23
Message-ID:	3663c1bfe329a2f934301604c56851f029c4c881.camel@sina.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Dear supporters,

I'm writing some scripts to implement manual failover. I have two
clusters(let's say p1 and p2), where one is primary(e.g. p1) and the
other is standby(e.g. p2). The way to do manual failover is straight
forward, like following:

1. promote on p2
2. wait `pg_is_ready()` on p2
3. rewind on p1
4. prepare a recovery.conf on p1
5. start p1

This should ends up with the same HA but role switched.

It works find if I manually do each step.

But if I call each step sequentially in a script, it will fail after I
switched role for the 1st time and want to switch back.

For example, with a fresh setup(timeline starts from 1), I firstly
tried to switch role, and it works. I get p1 as standby following p2,
which is the priamry. Then I switch role again and error occurs, the
error message is like:

< 2018-11-12 04:59:24.547 UTC > LOG: entering standby mode
< 2018-11-12 04:59:24.555 UTC > LOG: redo starts at 0/4000028
< 2018-11-12 04:59:24.566 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:24.566 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been
removed

< 2018-11-12 04:59:24.577 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:24.577 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been
removed

< 2018-11-12 04:59:25.413 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:26.416 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:27.419 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:28.422 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:29.425 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:29.576 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:29.576 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been removed

the pg_rewind output is as follow:

servers diverged at WAL position 0/5000060 on timeline 1
rewinding from last common checkpoint at 0/4000060 on timeline 1

From the log, it seems the wrong timeline of divergence is evaluated,
it should be timeline 2 rather than 1.

Furthermore, if I add a `sleep` between step 2(promote) and step
3(rewind), it just works.

Hence, I suspect the promoted cluster is not ready to be used for
rewinding right after promote. Is there anything I need to wait before
I rewind the old primary against this promoted cluster?

Thank you in advance!

---
magodo

Responses

Re: pg9.6 when is a promoted cluster ready to accept "rewind" request? at 2018-11-13 10:50:09 from talk to ben

Browse pgsql-general by date

	From	Date	Subject
Next Message	Karl Martin Skoldebrand	2018-11-12 09:17:35	Recommendation for upgrading from PostgreSQL 9.3
Previous Message	Ron	2018-11-12 01:16:19	Re: Move cluster to new host, upgraded version