Quick Links

Re: Recovery from WAL archives not advancing timeline?

From:	Don Seiler <don(at)seiler(dot)us>
To:	Ian Lawrence Barwick <barwick(at)gmail(dot)com>
Cc:	Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject:	Re: Recovery from WAL archives not advancing timeline?
Date:	2020-08-10 20:28:34
Message-ID:	CAHJZqBB5X37hGJJyUR2QG-NHqjUnQVTVcWxRk0Yjwsi=tpxv+Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

On Sat, Aug 8, 2020 at 2:24 PM Don Seiler <don(at)seiler(dot)us> wrote:

>
> Doing some more testing and it seems to boil down to the fact that a
> replica should create a .partial file when it is promoted. I tested both PG
> 9.6 and 10, with replicas streaming vs restoring WAL files. The sequence
> was always this:
>
>
> 1. Stop primary (A)
> 2. Observe primary writes on final WAL file and it is archived
> 3. Promote replica (B)
> 4. Add recovery.conf to A for just WAL recovery and start.
>
> In cases where the replica B was a streaming replica, upon promotion it
> would create a .partial file in the old timeline and then a history file in
> the new timeline and proceed. These files were then archived by wal-g and
> the new replica (A) would recover them and continue replication.
>
> If replica B was not streaming to begin with (so it was just restoring and
> recovering WAL from archives), then upon promotion it would NOT create a
> .partial file. It would just create the .history file and open up. When A
> is then started in step 4, it would complain like this (in PG 9.6):
>
> 2020-08-08 19:13:43.337 UTC [17837] LOG: database system was shut down at
> 2020-08-08 19:09:24 UTC
>
> 2020-08-08 19:13:44.580 UTC [17837] LOG: restored log file
> "00000002.history" from archive
>
> ERROR: 2020/08/08 19:13:45.213175 Archive '00000003.history' does not
> exist.
> 2020-08-08 19:13:45.214 UTC [17837] LOG: entering standby mode
>
> 2020-08-08 19:13:45.676 UTC [17837] LOG: restored log file
> "00000002.history" from archive
> 2020-08-08 19:13:46.327 UTC [17837] LOG: restored log file
> "000000020000000000000006" from archive
>
> 2020-08-08 19:13:46.381 UTC [17837] LOG: invalid xl_info in primary
> checkpoint record
> 2020-08-08 19:13:46.381 UTC [17837] PANIC: could not locate a valid
> checkpoint record
>
> 2020-08-08 19:13:48.095 UTC [17836] LOG: startup process (PID 17837) was
> terminated by signal 6: Aborted
> 2020-08-08 19:13:48.095 UTC [17836] LOG: aborting startup due to startup
> process failure
> 2020-08-08 19:13:48.105 UTC [17836] LOG: database system is shut down
>
> I even tested forcing an additional log switch from the new primary (which
> is the example you see above) but it still doesn't jive.
>
> I'm not sure if this is a bug or expected behavior with non-streaming
> replicas but I believe it means that all of my old primaries need to be
> restored from new backups as the new primaries were all
> restore/recovery setup (due to VPN bandwidth limitations that made
> streaming difficult). Unless someone has a way forward?
>

I'm following up to see if anyone else would categorize this as a bug. It
renders my old primaries dead until I do a restore on them.

Don.

--
Don Seiler
www.seiler.us

In response to

Re: Recovery from WAL archives not advancing timeline? at 2020-08-08 19:24:39 from Don Seiler

Responses

Re: Recovery from WAL archives not advancing timeline? at 2020-08-11 00:03:54 from Rui DeSousa

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Rui DeSousa	2020-08-11 00:03:54	Re: Recovery from WAL archives not advancing timeline?
Previous Message	Mohammed Afsar	2020-08-10 11:51:44	Postgresql background writer and checkpointer memory utilization in Linux server