From: | magodo <wztdyl(at)sina(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | pg9.6 when is a promoted cluster ready to accept "rewind" request? |
Date: | 2018-11-12 05:11:23 |
Message-ID: | 3663c1bfe329a2f934301604c56851f029c4c881.camel@sina.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Dear supporters,
I'm writing some scripts to implement manual failover. I have two
clusters(let's say p1 and p2), where one is primary(e.g. p1) and the
other is standby(e.g. p2). The way to do manual failover is straight
forward, like following:
1. promote on p2
2. wait `pg_is_ready()` on p2
3. rewind on p1
4. prepare a recovery.conf on p1
5. start p1
This should ends up with the same HA but role switched.
It works find if I manually do each step.
But if I call each step sequentially in a script, it will fail after I
switched role for the 1st time and want to switch back.
For example, with a fresh setup(timeline starts from 1), I firstly
tried to switch role, and it works. I get p1 as standby following p2,
which is the priamry. Then I switch role again and error occurs, the
error message is like:
< 2018-11-12 04:59:24.547 UTC > LOG: entering standby mode
< 2018-11-12 04:59:24.555 UTC > LOG: redo starts at 0/4000028
< 2018-11-12 04:59:24.566 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:24.566 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been
removed
< 2018-11-12 04:59:24.577 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:24.577 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been
removed
< 2018-11-12 04:59:25.413 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:26.416 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:27.419 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:28.422 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:29.425 UTC > FATAL: the database system is
starting up
< 2018-11-12 04:59:29.576 UTC > LOG: started streaming WAL from
primary at 0/5000000 on timeline 1
< 2018-11-12 04:59:29.576 UTC > FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 000000020000000000000005
has already been removed
the pg_rewind output is as follow:
servers diverged at WAL position 0/5000060 on timeline 1
rewinding from last common checkpoint at 0/4000060 on timeline 1
From the log, it seems the wrong timeline of divergence is evaluated,
it should be timeline 2 rather than 1.
Furthermore, if I add a `sleep` between step 2(promote) and step
3(rewind), it just works.
Hence, I suspect the promoted cluster is not ready to be used for
rewinding right after promote. Is there anything I need to wait before
I rewind the old primary against this promoted cluster?
Thank you in advance!
---
magodo
From | Date | Subject | |
---|---|---|---|
Next Message | Karl Martin Skoldebrand | 2018-11-12 09:17:35 | Recommendation for upgrading from PostgreSQL 9.3 |
Previous Message | Ron | 2018-11-12 01:16:19 | Re: Move cluster to new host, upgraded version |