Re: Corner-case bug in pg_rewind

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Ian Barwick <ian(dot)barwick(at)2ndquadrant(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Corner-case bug in pg_rewind
Date: 2020-09-29 12:00:13
Message-ID: d5824e45-4ad2-19c7-4c29-ceeb720efded@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/09/2020 09:42, Ian Barwick wrote:
> Take the following cluster with:
> - node1 (initial primary)
> - node2 (standby)
> - node3 (standby)
>
> Following activity takes place (greatly simplified from a real-world situation):
>
> 1. node1 is shut down.
> 2. node3 is promoted
> 3. node2 is attached to node3.
> 4. node1 is attached to node3
> 5. node1 is then promoted (creating a split brain situation with
> node1 and node3 as primaries)
> 6. node2 and node3 are shut down (in that order).
> 7. pg_rewind is executed to reset node2 so it can reattach
> to node1 as a standby. pg_rewind claims:
>
> pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2
> pg_rewind: no rewind required
>
> 8. based off that assurance, node2 is restarted with replication configuration
> pointing to node1 - but it is unable to attach, with node2's log reporting
> something like:
>
> new timeline 3 forked off current database system timeline 2
> before current recovery point X/XXXXXXX
>
> The cause is that pg_rewind is assuming that if the node's last
> checkpoint matches the
> divergence point, no rewind is needed:
>
> if (chkptendrec == divergerec)
> rewind_needed = false;
>
> but in this case there *are* records beyond the last checkpoint, which can be
> inferred from "minRecoveryPoint" - but this is not checked.

Yep, I think you're right.

> Attached patch addresses this. It includes a test, which doesn't make use of
> the RewindTest module, as that hard-codes a primary and a standby, while here
> three nodes are needed (I can't come up with a situation where this can be
> reproduced with only two nodes). The test sets "wal_keep_size" so would need
> modification for Pg12 and earlier.

I think we also need to change the extractPageMap() call:

> /*
> * Read the target WAL from last checkpoint before the point of fork, to
> * extract all the pages that were modified on the target cluster after
> * the fork. We can stop reading after reaching the final shutdown record.
> * XXX: If we supported rewinding a server that was not shut down cleanly,
> * we would need to replay until the end of WAL here.
> */
> if (showprogress)
> pg_log_info("reading WAL in target");
> extractPageMap(datadir_target, chkptrec, lastcommontliIndex,
> ControlFile_target.checkPoint, restore_command);
> filemap_finalize();

so that it scans all the way up to minRecoveryPoint, instead of stopping
at ControlFile_target.checkPoint. Otherwise, we will fail to rewind
changes that happened after the last checkpoint. And we need to do that
regardless of the "no rewind required" bug, any time we rewind a server
that's in DB_SHUTDOWNED_IN_RECOVERY state. I'm surprised none of the
existing tests have caught that. Am I missing something?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2020-09-29 12:40:42 Re: Disable WAL logging to speed up data loading
Previous Message Bharath Rupireddy 2020-09-29 11:39:06 Re: Skip ExecCheckRTPerms in CTAS with no data