pg9.6 when is a promoted cluster ready to accept "rewind" request?

From: magodo <wztdyl(at)sina(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: pg9.6 when is a promoted cluster ready to accept "rewind" request?
Date: 2018-11-12 05:11:23
Message-ID: 3663c1bfe329a2f934301604c56851f029c4c881.camel@sina.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Dear supporters,

I'm writing some scripts to implement manual failover. I have two
clusters(let's say p1 and p2), where one is primary(e.g. p1) and the
other is standby(e.g. p2). The way to do manual failover is straight
forward, like following:

1. promote on p2
2. wait `pg_is_ready()` on p2
3. rewind on p1
4. prepare a recovery.conf on p1
5. start p1

This should ends up with the same HA but role switched.

It works find if I manually do each step.

But if I call each step sequentially in a script, it will fail after I
switched role for the 1st time and want to switch back.

For example, with a fresh setup(timeline starts from 1), I firstly
tried to switch role, and it works. I get p1 as standby following p2,
which is the priamry. Then I switch role again and error occurs, the
error message is like:

< 2018-11-12 04:59:24.547 UTC > LOG: entering standby mode
< 2018-11-12 04:59:24.555 UTC > LOG: redo starts at 0/4000028
< 2018-11-12 04:59:24.566 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:24.566 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been
removed

< 2018-11-12 04:59:24.577 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:24.577 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been
removed

< 2018-11-12 04:59:25.413 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:26.416 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:27.419 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:28.422 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:29.425 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:29.576 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:29.576 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been removed

the pg_rewind output is as follow:

servers diverged at WAL position 0/5000060 on timeline 1
rewinding from last common checkpoint at 0/4000060 on timeline 1

From the log, it seems the wrong timeline of divergence is evaluated,
it should be timeline 2 rather than 1.

Furthermore, if I add a `sleep` between step 2(promote) and step
3(rewind), it just works.

Hence, I suspect the promoted cluster is not ready to be used for
rewinding right after promote. Is there anything I need to wait before
I rewind the old primary against this promoted cluster?

Thank you in advance!

---
magodo

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Karl Martin Skoldebrand 2018-11-12 09:17:35 Recommendation for upgrading from PostgreSQL 9.3
Previous Message Ron 2018-11-12 01:16:19 Re: Move cluster to new host, upgraded version