From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman |
Date: | 2022-04-09 00:59:10 |
Message-ID: | 20220409005910.alw46xqmmgny2sgr@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
Hi,
On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> I tried adjusting the patch so it does guarantee that (as attached),
> and in two out of two tries it got past the archive_cleanup_command
> failure but then hung up waiting for standby2 to promote.
Adding
$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;
makes the tests pass here.
What is that second test really testing?
# Check the presence of temporary files specifically generated during
# archive recovery. To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive. As this requires at least two timeline
# switches, promote the existing standby first. Then create a second
# standby based on the promoted one. Finally, the second standby is
# promoted.
Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:
$node_standby2->init_from_backup($node_primary, $backup_name,
has_restoring => 1);
It's created from the original primary, not the first standby, as the
description says...
Both nodes get promoted independently, in a run without valgrind:
standby:
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 2; prev tli 1; time 2022-04-08 17:23:42.96686-07
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000
standby2:
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 3; prev tli 1; time 2022-04-08 17:23:43.307443->
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000
except that standby2 can't choose tli 2 because it finds it used.
Sure looks like something is funky with that test.
But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?
on standby2 I can see replay progress like the following:
2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024488; LSN 0/30244C8: prev 0/3024448; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 60 flags 0x00
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024488 for Heap/INSERT: off 60 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/30244C8; LSN 0/3024508: prev 0/3024488; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 61 flags 0x00
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/30244C8 for Heap/INSERT: off 61 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024508; LSN 0/3024548: prev 0/30244C8; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 62 flags 0x00
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024508 for Heap/INSERT: off 62 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG""
note that we appear to wait between replaying of records, even though we
apparently have WAL for the next record!
And interestingly I'm not seeing the
"switched WAL source from stream to archive after failure"
lines I'd expect.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2022-04-09 13:51:54 | pgsql: Add missing source files to nls.mk |
Previous Message | Andres Freund | 2022-04-08 23:56:08 | Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2022-04-09 01:00:36 | Re: avoid multiple hard links to same WAL file after a crash |
Previous Message | Justin Pryzby | 2022-04-09 00:27:58 | Re: Mingw task for Cirrus CI |