Re: Recovery from WAL archives not advancing timeline?

From: Don Seiler <don(at)seiler(dot)us>
To: Ian Lawrence Barwick <barwick(at)gmail(dot)com>
Cc: Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Recovery from WAL archives not advancing timeline?
Date: 2020-08-08 19:24:39
Message-ID: CAHJZqBAH0cg9qMzUteyvJqJBbAFyY5y84P-kJTgcYQkXi+bcAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Sat, Aug 8, 2020 at 11:10 AM Don Seiler <don(at)seiler(dot)us> wrote:

> On Sat, Aug 8, 2020 at 10:59 AM Don Seiler <don(at)seiler(dot)us> wrote:
>
>>
>> Do you think it would be worth trying to delete (move/rename)
>> the 000000010000142E0000005B file and see if it changes things?
>>
>
> In my PG10 test cluster, I see a ".partial" file getting archived as well.
> In this case I would expect to see a 000000010000142E0000005B.partial.lz4
> file in the archive but I don't. Although my test cluster used streaming
> replication where as this was done without streaming (just WAL
> archive/restore/recovery), and this one is PG 9.6 (I have both 9.6 and 10
> here). I'll re-run my tests without streaming and both 9.6 and 10.
>

Doing some more testing and it seems to boil down to the fact that a
replica should create a .partial file when it is promoted. I tested both PG
9.6 and 10, with replicas streaming vs restoring WAL files. The sequence
was always this:

1. Stop primary (A)
2. Observe primary writes on final WAL file and it is archived
3. Promote replica (B)
4. Add recovery.conf to A for just WAL recovery and start.

In cases where the replica B was a streaming replica, upon promotion it
would create a .partial file in the old timeline and then a history file in
the new timeline and proceed. These files were then archived by wal-g and
the new replica (A) would recover them and continue replication.

If replica B was not streaming to begin with (so it was just restoring and
recovering WAL from archives), then upon promotion it would NOT create a
.partial file. It would just create the .history file and open up. When A
is then started in step 4, it would complain like this (in PG 9.6):

2020-08-08 19:13:43.337 UTC [17837] LOG: database system was shut down at
2020-08-08 19:09:24 UTC

2020-08-08 19:13:44.580 UTC [17837] LOG: restored log file
"00000002.history" from archive

ERROR: 2020/08/08 19:13:45.213175 Archive '00000003.history' does not exist.
2020-08-08 19:13:45.214 UTC [17837] LOG: entering standby mode

2020-08-08 19:13:45.676 UTC [17837] LOG: restored log file
"00000002.history" from archive
2020-08-08 19:13:46.327 UTC [17837] LOG: restored log file
"000000020000000000000006" from archive

2020-08-08 19:13:46.381 UTC [17837] LOG: invalid xl_info in primary
checkpoint record
2020-08-08 19:13:46.381 UTC [17837] PANIC: could not locate a valid
checkpoint record

2020-08-08 19:13:48.095 UTC [17836] LOG: startup process (PID 17837) was
terminated by signal 6: Aborted
2020-08-08 19:13:48.095 UTC [17836] LOG: aborting startup due to startup
process failure
2020-08-08 19:13:48.105 UTC [17836] LOG: database system is shut down

I even tested forcing an additional log switch from the new primary (which
is the example you see above) but it still doesn't jive.

I'm not sure if this is a bug or expected behavior with non-streaming
replicas but I believe it means that all of my old primaries need to be
restored from new backups as the new primaries were all
restore/recovery setup (due to VPN bandwidth limitations that made
streaming difficult). Unless someone has a way forward?

Don.
--
Don Seiler
www.seiler.us

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Mohammed Afsar 2020-08-10 10:15:15 Postgresql background writer and checkpointer memory utilization in Linux server
Previous Message Don Seiler 2020-08-08 16:10:24 Re: Recovery from WAL archives not advancing timeline?