Re: Race condition in recovery?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: robertmhaas(at)gmail(dot)com, dilipbalaut(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Race condition in recovery?
Date: 2021-06-11 05:26:44
Message-ID: 20210611.142644.1872001951622668861.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Fri, 11 Jun 2021 14:07:45 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> > conchuela's failure is evidently not every time, but this test
> > definitely postdates the "fix":

conchuela failed recovery_check this time, and

> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08
> So the standby2 was stuck after selecting the new timeline and before
> updating control file and its postmaster couldn't even respond to
> SIGQUIT.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25

This is before the "fix"

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-08%2014%3A07%3A46

failed in pg_verifybackupCheck

> ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/regress_log_003_corruption ==~_~===-=-===~_~==
...
> # Failed test 'base backup ok'
> # at t/003_corruption.pl line 115.
> # Running: pg_verifybackup /home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails
> pg_verifybackup: fatal: could not open file "/home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails/backup_manifest": No such file or directory
> not ok 38 - intact backup verified

The manifest file is missing in backup. In this case also the servers
failed to handle SIGQUIT.

> ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log ==~_~===-=-===~_~==
...
> 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_51792" 0/B000000 TIMELINE 1
> 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT: START_REPLICATION SLOT "pg_basebackup_51792" 0/B000000 TIMELINE 1
(log ends here)

There seems like some hardware failure?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-06-11 05:29:53 Re: logical replication of truncate command with trigger causes Assert
Previous Message Yura Sokolov 2021-06-11 05:21:01 Re: Add PortalDrop in exec_execute_message