Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Georgy Shelkovy <g(dot)shelkovy(at)arenadata(dot)io>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
Date: 2024-08-09 15:26:29
Message-ID: fe7af87d-b9bf-481a-902c-089c49aa911d@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 08/08/2024 16:07, Heikki Linnakangas wrote:
> So it seems true that rewind is not required in those cases. However, if
> the WAL is already written on the standby's disk, just not replayed yet,
> then when you restart the server, it will replay the WAL from timeline
> 1. That does seem surprising. Perhaps pg_rewind should just update the
> minRecoveryPoint and minRecoveryTLI in control file in that case, to
> force WAL recovery to follow the timeline switch to TLI 2.
>
> I will try to write a TAP test for the "Bad" and the assertion failure
> case, fix the assertion failure, and test how updating the
> minRecoveryPoint would behave.

I've divided this into three separate issues:

1. I think the assertion in pg_rewind is simply bogus, and we should
remove it. Attached 0002-Remove-bogus-assertion-in-pg_rewind.patch does
that, and adds a test case to cover it.

2. Independently of pg_rewind: When you start PostgreSQL, it will first
try to recover all the WAL it has locally in pg_wal. That goes wrong if
you have set a recovery target TLI. For example, imagine this situation:

- Recovery target TLI is 2, set explicitly in postgresql.conf
- The switchpoint from TLI 1 to 2 happened at WAL position 0/1510198
(the switchpoint is found in 00000002.history)
- There is a WAL file 000000010000000000000001 under pg_wal, which
contains valid WAL up to 0/1590000

When you start the server, it will first recover all the WAL from
000000010000000000000001, up to 0/1590000. Then it will connect to the
primary to fetch mor WAL, but it will fail to make any progress because
it already blew past the switch point.

It's obviously wrong to replay the WAL from timeline 1 beyond the 1->2
switchpoint, when the recovery target is TLI 2. The attached
0003-Don-t-read-past-current-TLI-during-archive-recovery.patch fixes
that. However, the logic to find the right WAL segment file and read the
WAL is extremely complicated, and I don't feel comfortable that I got
all the cases right. Review would be highly appreciated.

The patch includes a test case to demonstrate the case, with no
pg_rewind. It does include one "manual" step to copy a timeline history
file into pg_wal, marked with XXX, however. So I'm not sure how possible
this scenario is in production setups .

3. When pg_rewind has nothing to do, the target server is left
unmodified, in a state such that when you restart it, it will replay all
the WAL it has locally in pg_wal first, before connecting to the
primary. Even though the target is a direct ancestor of the source and
hence it *can* follow the WAL to the source's position without
rewinding, it doesn't mean that it *will* actually do so.

The attached changes it so that it updates the control file in that
case, setting minRecoveryPoint and minRecoveryPointTLI to point to the
source's current WAL position. That way, when you start it up, it will
follow the timeline history to reach that point. (This requires fixing
issue 2, because otherwise it still won't follow the history correctly
to reach the minRecoveryPointTLI)

To make the test pass, it actually would be sufficient to copy the
timeline history file into pg_wal (which the patch also does). But
updating the minRecoveryPoint seems good to ensure that it follows the
right timeline. Otherwise it relies on the logic at startup to find the
latest timeline, and that the latest timeline is the one you tried to
rewind to. I think it would go wrong if there was another
higher-numbered history file present in pg_wal for some reason.

All in all, I don't feel very confident about all this. The assertion
seems straightforward, so barring objections I'll commit and backpatch
that. The timeline-following at startup (issue 2) seems pretty clearly
wrong, but I'm not sure it's worth the risk to backpatch. Similarly
issue 3 might not be worth the risk to backpatch, especially if we don't
also backpatch 2. I would love to hear comments on those.

Georgy, if you have the possibility to test this patches with your repro
script, that would be highly appreciated.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
0001-If-a-command-dies-in-a-command_checks_all-test-print.patch text/x-patch 1.1 KB
0002-Remove-bogus-assertion-in-pg_rewind.patch text/x-patch 5.9 KB
0003-Don-t-read-past-current-TLI-during-archive-recovery.patch text/x-patch 11.1 KB
0004-Update-minRecoveryPoint-when-no-rewind-is-required.patch text/x-patch 5.4 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Aleš Zelený 2024-08-09 16:18:44 Re: BUG #18573: Analyze command consumes several GB of memory - more than analyzed table size
Previous Message Tom Lane 2024-08-09 15:25:27 Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for