Re: Race condition in recovery?

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-06-09 06:44:50
Message-ID: CAFiTN-tPh8eR1zHc7WCMbBMKn4bOfwvKK0fqKKhY6phVV4ENpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Then I tried to get things working on 9.6. There's a patch attached to
> back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
> a version of the main patch attached with the necessary wal->xlog,
> lsn->location renaming. Unfortunately ... the new test case still
> fails on 9.6 in a way that looks an awful lot like the bug isn't
> actually fixed:
>
> LOG: primary server contains no more WAL on requested timeline 1
> cp:
> /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
> No such file or directory
> (repeated many times)
>
> I find that the same failure happens if I back-port the master version
> of the patch to v10 or v11,

I think this fails because prior to v12 the recovery target tli was not set
to the latest by default because it was not GUC at that time. So after
below fix it started passing on v11(only tested on v11 so far).

diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 842878a..b3ce5da 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -50,6 +50,9 @@ my $node_cascade = get_new_node('cascade');
$node_cascade->init_from_backup($node_standby, $backup_name,
has_streaming => 1);
$node_cascade->enable_restoring($node_primary);
+$node_cascade->append_conf('recovery.conf', qq(
+recovery_target_timeline='latest'
+));

But now it started passing even without the fix and the log says that it
never tried to stream from primary using TL 1 so it never hit the defect
location.

2021-06-09 12:11:08.618 IST [122456] LOG: entering standby mode
2021-06-09 12:11:08.622 IST [122456] LOG: restored log file
"00000002.history" from archive
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’:
No such file or directory
2021-06-09 12:11:08.627 IST [122456] LOG: redo starts at 0/2000028
2021-06-09 12:11:08.627 IST [122456] LOG: consistent recovery state
reached at 0/3000000

Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is
not hitting the defect location at all. And after that, I will check the
status on other older versions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuro Yamada 2021-06-09 06:58:28 Re: Duplicate history file?
Previous Message Tom Lane 2021-06-09 06:43:24 Re: [bug?] Missed parallel safety checks, and wrong parallel safety