Re: Race condition in recovery?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race condition in recovery?
Date: 2021-06-08 20:37:07
Message-ID: CA+TgmoZTWe2jyGvCCziNuEXzbaxZ6+E64GbejELYhvrPV8=k+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 8, 2021 at 12:26 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think the problem is here:
>
> Can't locate object method "lsn" via package "PostgresNode" at
> t/025_stuck_on_old_timeline.pl line 84.
>
> When that happens, it bails out, and cleans everything up, doing an
> immediate shutdown of all the nodes. The 'lsn' method was added by
> commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
> v10 and later. I think maybe we can think of back-porting that to 9.6.

Here's an updated set of patches. I removed the extra teardown_node
calls per Kyotaro Horiguchi's request. I adopted his suggestion for
setting a $perlbin variable from $^X, but found that $perlbin was
undefined, so I split the incantation into two lines to fix that. I
updated the code to use ->promote() instead of calling pg_promote(),
and to use poll_query_until() afterwards to wait for promotion as
suggested by Dilip. Also, I added a comment to the change in xlog.c.

Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:

LOG: primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)

I find that the same failure happens if I back-port the master version
of the patch to v10 or v11, but if I back-port it to v12 or v13 then
the test passes as expected. I haven't figured out what the issue is
yet. I also noticed that if I back-port it to v12 and then revert the
code change, the test still passes. So I think there may be something
subtly wrong with this test case yet. Or maybe a code bug.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v7-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch application/octet-stream 7.2 KB
9.6-v7-0002-Fix-corner-case-failure-of-new-standby-to-follow-.patch application/octet-stream 7.2 KB
9.6-v7-0001-Back-port-a-few-PostgresNode.pm-methods.patch application/octet-stream 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-06-08 20:48:12 Re: Remove server and libpq support for the version 2 wire protocol
Previous Message Jeff Davis 2021-06-08 20:23:45 Re: Decoding of two-phase xacts missing from CREATE_REPLICATION_SLOT command