Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date: 2022-04-09 00:59:10
Message-ID: 20220409005910.alw46xqmmgny2sgr@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> I tried adjusting the patch so it does guarantee that (as attached),
> and in two out of two tries it got past the archive_cleanup_command
> failure but then hung up waiting for standby2 to promote.

Adding

$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;

makes the tests pass here.

What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery. To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive. As this requires at least two timeline
# switches, promote the existing standby first. Then create a second
# standby based on the promoted one. Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

$node_standby2->init_from_backup($node_primary, $backup_name,
has_restoring => 1);

It's created from the original primary, not the first standby, as the
description says...

Both nodes get promoted independently, in a run without valgrind:

standby:
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 2; prev tli 1; time 2022-04-08 17:23:42.96686-07
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000

standby2:
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 3; prev tli 1; time 2022-04-08 17:23:43.307443->
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000

except that standby2 can't choose tli 2 because it finds it used.

Sure looks like something is funky with that test.

But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?

on standby2 I can see replay progress like the following:
2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024488; LSN 0/30244C8: prev 0/3024448; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 60 flags 0x00
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024488 for Heap/INSERT: off 60 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/30244C8; LSN 0/3024508: prev 0/3024488; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 61 flags 0x00
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/30244C8 for Heap/INSERT: off 61 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024508; LSN 0/3024548: prev 0/30244C8; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 62 flags 0x00
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024508 for Heap/INSERT: off 62 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""

note that we appear to wait between replaying of records, even though we
apparently have WAL for the next record!

And interestingly I'm not seeing the
"switched WAL source from stream to archive after failure"
lines I'd expect.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Peter Eisentraut 2022-04-09 13:51:54 pgsql: Add missing source files to nls.mk
Previous Message Andres Freund 2022-04-08 23:56:08 Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2022-04-09 01:00:36 Re: avoid multiple hard links to same WAL file after a crash
Previous Message Justin Pryzby 2022-04-09 00:27:58 Re: Mingw task for Cirrus CI