From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Race condition in recovery? |
Date: | 2021-06-08 20:37:07 |
Message-ID: | CA+TgmoZTWe2jyGvCCziNuEXzbaxZ6+E64GbejELYhvrPV8=k+Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jun 8, 2021 at 12:26 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think the problem is here:
>
> Can't locate object method "lsn" via package "PostgresNode" at
> t/025_stuck_on_old_timeline.pl line 84.
>
> When that happens, it bails out, and cleans everything up, doing an
> immediate shutdown of all the nodes. The 'lsn' method was added by
> commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
> v10 and later. I think maybe we can think of back-porting that to 9.6.
Here's an updated set of patches. I removed the extra teardown_node
calls per Kyotaro Horiguchi's request. I adopted his suggestion for
setting a $perlbin variable from $^X, but found that $perlbin was
undefined, so I split the incantation into two lines to fix that. I
updated the code to use ->promote() instead of calling pg_promote(),
and to use poll_query_until() afterwards to wait for promotion as
suggested by Dilip. Also, I added a comment to the change in xlog.c.
Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:
LOG: primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)
I find that the same failure happens if I back-port the master version
of the patch to v10 or v11, but if I back-port it to v12 or v13 then
the test passes as expected. I haven't figured out what the issue is
yet. I also noticed that if I back-port it to v12 and then revert the
code change, the test still passes. So I think there may be something
subtly wrong with this test case yet. Or maybe a code bug.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
v7-0001-Fix-corner-case-failure-of-new-standby-to-follow-.patch | application/octet-stream | 7.2 KB |
9.6-v7-0002-Fix-corner-case-failure-of-new-standby-to-follow-.patch | application/octet-stream | 7.2 KB |
9.6-v7-0001-Back-port-a-few-PostgresNode.pm-methods.patch | application/octet-stream | 3.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2021-06-08 20:48:12 | Re: Remove server and libpq support for the version 2 wire protocol |
Previous Message | Jeff Davis | 2021-06-08 20:23:45 | Re: Decoding of two-phase xacts missing from CREATE_REPLICATION_SLOT command |