Re: Add checks in pg_rewind to abort if backup_label file is present

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Krishnakumar R <kksrcv001(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add checks in pg_rewind to abort if backup_label file is present
Date: 2023-12-05 20:14:23
Message-ID: 1f299164-f70b-499d-9863-f0e9406ace4a@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/12/2023 19:36, Krishnakumar R wrote:
> Hi,
>
> Please find the attached patch for $subject and associated test. Please review.

Thanks for picking up this long-standing TODO!

> +/*
> + * Check if a file is present using the connection to the
> + * database.
> + */
> +static bool
> +libpq_is_file_present(rewind_source *source, const char *path)
> +{
> + PGconn *conn = ((libpq_source *) source)->conn;
> + PGresult *res;
> + const char *paramValues[1];
> +
> + paramValues[0] = path;
> + res = PQexecParams(conn, "SELECT pg_stat_file($1)",
> + 1, NULL, paramValues, NULL, NULL, 1);
> + if (PQresultStatus(res) != PGRES_TUPLES_OK)
> + return false;
> +
> + return true;
> +}

The backup_label file cannot be present when the server is running. No
need to check for that when connected to a live server.

> --- a/src/bin/pg_rewind/pg_rewind.c
> +++ b/src/bin/pg_rewind/pg_rewind.c
> @@ -729,7 +729,11 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
> static void
> sanityChecks(void)
> {
> - /* TODO Check that there's no backup_label in either cluster */
> + if (source->is_file_present(source, "backup_label"))
> + pg_fatal("The backup_label file is present in source cluster");
> +
> + if (is_file_present(datadir_target, "backup_label"))
> + pg_fatal("The backup_label file is present in target cluster");
>
> /* Check system_identifier match */
> if (ControlFile_target.system_identifier != ControlFile_source.system_identifier)

The error message isn't very user friendly. It's pretty dangerous
actually: I think a lot of users would just delete the backup_label file
when they see that message. Because then the file is no longer present
and problem solved, right?

The point of checking for backup_label is that if it's present, the
cluster wasn't really shut down cleanly. The correct fix is to start it,
let WAL recovery finish, and shut it down again. The error message
should make that clear. Perhaps make it similar to the existing "target
server must be shut down cleanly" message.

I think today if you try to run pg_rewind on a cluster that was restored
from a backup, so that backup_label is present, you get the "target
server must be shut down cleanly" message. But we don't have any tests
for it. We do have a test for when the server is still running, but not
for the restored-from-backup case. Would be nice to add one.

Perhaps move the backup_label check later in sanityChecks(), after
checking the state in the control file. That way, you still normally hit
the "target server must be shut down cleanly" case, and the backup_label
check would be just an additional "can't happen" sanity check.

In createBackupLabel() we have this:

> /* TODO: move old file out of the way, if any. */
> open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
> write_target_range(buf, 0, len);
> close_target_file();

That TODO comment needs to go away now. And we probably should complain
if the file already exists. With your patch, we already checked earlier
that it doesn't exists, so if it exists when we reach that code,
something's gone wrong.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-12-05 20:48:33 Re: UBSan pointer overflow in xlogreader.c
Previous Message Tom Lane 2023-12-05 20:08:22 Re: backtrace_on_internal_error