Re: pg_verifybackup: TAR format backup verification

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Sravan Kumar <sravanvcybage(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_verifybackup: TAR format backup verification
Date: 2024-09-12 11:04:54
Message-ID: CAAJ_b955y7NrkTUGtiyoGkPBs-OgYMv32iW0WTSpVfpFpezxdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 10, 2024 at 1:31 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I would rather that you didn't add simple_ptr_list_destroy_deep()
> given that you don't need it for this patch series.
>
Done.

> +
> "\"%s\" unexpected file in the tar format backup",
>
> This doesn't seem grammatical to me. Perhaps change this to: file
> \"%s\" is not expected in a tar format backup
>
Ok, updated in the attached version.

> + /* We are only interested in files that are not in the ignore list. */
> + if (member->is_directory || member->is_link ||
> + should_ignore_relpath(mystreamer->context, member->pathname))
> + return;
>
> Doesn't this need to happen after we add pg_tblspc/$OID to the path,
> rather than before? I bet this doesn't work correctly for files in
> user-defined tablespaces, compared to the way it work for a
> directory-format backup.
>
> + char temp[MAXPGPATH];
> +
> + /* Copy original name at temporary space */
> + memcpy(temp, member->pathname, MAXPGPATH);
> +
> + snprintf(member->pathname, MAXPGPATH, "%s/%d/%s",
> + "pg_tblspc", mystreamer->tblspc_oid, temp);
>
> I don't like this at all. This function doesn't have any business
> modifying the astreamer_member, and it doesn't need to. It can just do
> char *pathname; char tmppathbuf[MAXPGPATH] and then set pathname to
> either member->pathname or tmppathbuf depending on
> OidIsValid(tblspcoid). Also, shouldn't this be using %u, not %d?
>
True, fixed in the attached version.

> + mystreamer->mfile = (void *) m;
>
> Either the cast to void * isn't necessary, or it indicates that
> there's a type mismatch that should be fixed.
>
Fixed -- was added in the very first version and forgotten in later updates.

> + * We could have these checks while receiving contents. However, since
> + * contents are received in multiple iterations, this would result in
> + * these lengthy checks being performed multiple times. Instead, setting
> + * flags here and using them before proceeding with verification will be
> + * more efficient.
>
> Seems unnecessary to explain this.
>
Removed.

> + Assert(mystreamer->verify_checksum);
> +
> + /* Should have came for the right file */
> + Assert(strcmp(member->pathname, m->pathname) == 0);
> +
> + /*
> + * The checksum context should match the type noted in the backup
> + * manifest.
> + */
> + Assert(checksum_ctx->type == m->checksum_type);
>
> What do you think about:
>
> Assert(m != NULL && !m->bad);
> Assert(checksum_ctx->type == m->checksum_type);
> Assert(strcmp(member->pathname, m->pathname) == 0);
>
> Or possibly change the first one to Assert(should_verify_checksum(m))?
>
LGTM.

> + memcpy(&control_file, streamer->bbs_buffer.data,
> sizeof(ControlFileData));
>
> This probably doesn't really hurt anything, but it's a bit ugly. You
> first use astreamer_buffer_until() to force the entire file into a
> buffer. And then here, you copy the entire file into a second buffer
> which is exactly the same except that it's guaranteed to be properly
> aligned. It would be possible to include a ControlFileData in
> astreamer_verify and copy the bytes directly into it (you'd need a
> second astreamer_verify field for the number of bytes already copied
> into that structure). I'm not 100% sure that's worth the code, but it
> seems like it wouldn't take more than a few lines, so perhaps it is.
>
I think we could skip this memcpy() and directly cast
streamer->bbs_buffer.data to ControlFileData *, as we already ensure
that the correct length is being read just before this memcpy(). Did
the same in the attached version.

> +/*
> + * Verify plain backup.
> + */
> +static void
> +verify_plain_backup(verifier_context *context)
> +{
> + Assert(context->format == 'p');
> + verify_backup_directory(context, NULL, context->backup_directory);
> +}
> +
>
> This seems like a waste of space.
>
Yeah, but aim to keep the function name more self-explanatory and
consistent with the naming style.

> +verify_tar_backup(verifier_context *context)
>
> I like this a lot better now! I'm still not quite sure about the
> decision to have the ignore list apply to both the backup directory
> and the tar file contents -- but given the low participation on this
> thread, I don't think we have much chance of getting feedback from
> anyone else right now, so let's just do it the way you have it and we
> can change it later if someone shows up to complain.
>
Ok.

> +verify_all_tar_files(verifier_context *context, SimplePtrList *tarfiles)
>
> I think this code could be moved into its only caller instead of
> having a separate function. And then if you do that, maybe
> verify_one_tar_file could be renamed to just verify_tar_file. Or
> perhaps that function could also be removed and just move the code
> into the caller. It's not much code and not very deeply nested.
> Similarly create_archive_verifier() could be considered for this
> treatment. Maybe inlining all of these is too much and the result will
> look messy, but I think we should at least try to combine some of
> them.
>
I have removed verify_all_tar_files() and renamed
verify_one_tar_file() as suggested. However, I can't merge further
because I need verify_tar_file() (formerly verify_one_tar_file()) to
remain a separate function. This way, regardless of whether it
succeeds or encounters an error, I can easily perform cleanup
afterward.

On Tue, Sep 10, 2024 at 10:54 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> + pg_log_error("pg_waldump cannot read from a tar");
>
> "tar" isn't normally used as a noun as you do here, so I think this
> should say "pg_waldump cannot read tar files".
>
Done.

> Technically, the position of this check could lead to an unnecessary
> failure, if -n wasn't specified but pg_wal.tar(.whatever) also doesn't
> exist on disk. But I think it's OK to ignore that case.
>
> However, I also notice this change to the TAP tests in a few places:
>
> - [ 'pg_verifybackup', $tempdir ],
> + [ 'pg_verifybackup', '-Fp', $tempdir ],
>
> It's not the end of the world to have to make a change like this, but
> it seems easy to do better. Just postpone the call to
> find_backup_format() until right after we call parse_manifest_file().
> That also means postponing the check mentioned above until right after
> that, but that's fine: after parse_manifest_file() and then
> find_backup_format(), you can do if (!no_parse_wal && context.format
> == 't') { bail out }.
>
Done.

> + if (stat(path, &sb) == 0)
> + result = 'p';
> + else
> + {
> + if (errno != ENOENT)
> + {
> + pg_log_error("cannot determine backup format : %m");
> + pg_log_error_hint("Try \"%s --help\" for more
> information.", progname);
> + exit(1);
> + }
> +
> + /* Otherwise, it is assumed to be a tar format backup. */
> + result = 't';
> + }
>
> This doesn't look good, for a few reasons:
>
> 1. It would be clearer to structure this as if (stat(...) == 0) result
> = 'p'; else if (errno == ENOENT) result = 't'; else { report an error;
> } instead of the way you have it.
>
> 2. "cannot determine backup format" is not an appropriate way to
> report the failure of stat(). The appropriate message is "could not
> stat file \"%s\"".
>
> 3. It is almost never correct to put a space before a colon in an error message.
>
> 4. The hint doesn't look helpful, or necessary. I think you can just
> delete that.
>
> Regarding both point #2 and point #4, I think we should ask ourselves
> how stat() could realistically fail here. On my system (macOS), the
> document failed modes for stat() are EACCES (i.e. permission denied),
> EFAULT (i.e. we have a bug in pg_verifybackup), EIO (I/O Error), ELOOP
> (symlink loop), ENAMETOOLONG, ENOENT, ENOTDIR, and EOVERFLOW. In none
> of those cases does it seem likely that specifying the format manually
> will help anything. Thus, suggesting that the user look at the help,
> presumably to find --format, is unlikely to solve anything, and
> telling them that the error happened while trying to determine the
> backup format isn't really helping anything, either. What the user
> needs to know is that it was stat() that failed, and the pathname for
> which it failed. Then they need to sort out whatever problem is
> causing them to get one of those really weird errors.
>
Done.

> - of the backup. The backup must be stored in the "plain"
> - format; a "tar" format backup can be checked after extracting it.
> + of the backup. The backup must be stored in the "plain" or "tar"
> + format. Verification is supported for <literal>gzip</literal>,
> + <literal>lz4</literal>, and <literal>zstd</literal> compressed tar backup;
> + any other compressed format backups can be checked after decompressing them.
>
> I don't think that we need to say that the backup must be stored in
> the plain or tar format, because those are the only backup formats
> pg_basebackup knows about. Similarly, it doesn't seem help to me to
> enumerate all the compression algorithms that pg_basebackup supports
> and say we only support those; what else would a user expect?
>
> What I would do is replace the original sentence ("The backup must be
> stored...") with: The backup may be stored either in the "plain" or
> the "tar" format; this includes "tar" backups compressed with any
> algorithm supported by pg_basebackup. However, at present, WAL
> verification is supported only for plain-format backups. Therefore, if
> the backup is stored in "tar" format, the <literal>-n,
> --no-parse-wal<literal> option should be used.
>
Done

> + # Add tar backup format option
> + push @backup, ('-F', 't');
> + # Add switch to skip WAL verification.
> + push @verify, ('-n');
>
> Say why, not what. The second comment should say something like "WAL
> verification not yet supported for tar-format backups".
>
Done.

> + "$format backup fails with algorithm \"$algorithm\"");
> + $primary->command_ok(\(at)backup, "$format backup ok with
> algorithm \"$algorithm\"");
> + ok(-f "$backup_path/backup_manifest", "$format backup
> manifest exists");
> + "verify $format backup with algorithm \"$algorithm\"");
>
> Personally I would change "$format" to "$format format" in all of
> these places, so that we talk about a "tar format backup" or a "plain
> format backup" instead of a "tar backup" or a "plain backup".
>
Done.

> + 'skip_on_windows' => 1
>
> I don't understand why 4 of the 7 new tests are skipped on Windows.
> The existing "skip" message for this parameter says "unix-style
> permissions not supported on Windows" but that doesn't seem applicable
> for any of the new cases, and I couldn't find a comment about it,
> either.
>
I was a bit unsure whether Windows could handle unpacking and
repacking tar files and the required path formats for these tests but
the "Cirrus CI / Windows - Server 2019, VS 2019" workflow doesn’t have
any issues with them. I’ve removed the flag.

> + my @files = glob("*");
> + system_or_bail($tar, '-cf', "$backup_path/$archive", @files);
>
> Why not just system_or_bail($tar, '-cf', "$backup_path/$archive", '.')?
>
Doesn't suit since re-packing includes "./" at the beginning of each file path.

> Also, instead of having separate entries in the test array to do
> basically the same thing on Windows, could we just iterate through the
> test array twice and do everything once for plain format and then a
> second time for tar format, and do the tests once for each? I don't
> think that idea QUITE works, because the open_file_fails,
> open_directory_fails, and search_directory_fails tests are really not
> applicable to tar format. But we could rename skip_on_windows to
> tests_file_permissions and skip those both on Windows and for tar
> format. But aside from that, I don't quite see why it makes sense to,
> for example, test extra_file for both formats but not
> extra_tablespace_file, and indeed it seems like an important bit of
> test coverage.
>
Added test extra_file and missing_file test for tablespace as well.

> I also feel like we should have tests someplace that add extra files
> to a tar-format backup in the backup directory (e.g. 1234567.tar, or
> wug.tar, or 123456.wug) or remove entire files.
>
If I am not missing something, tar_backup_unexpected_file test does
that. I have added a test that removes the tablespace archive in the
attached version.

The updated version attached. Thank you for the review !

Regards,
Amul

Attachment Content-Type Size
v14-0006-Refactor-split-verify_backup_file-function-and-r.patch application/x-patch 6.3 KB
v14-0007-Refactor-split-verify_file_checksum-function.patch application/x-patch 3.2 KB
v14-0008-Refactor-split-verify_control_file.patch application/x-patch 5.4 KB
v14-0009-Add-simple_ptr_list_destroy-and-simple_ptr_list_.patch application/x-patch 1.7 KB
v14-0010-pg_verifybackup-Add-backup-format-and-compressio.patch application/x-patch 5.8 KB
v14-0011-pg_verifybackup-Read-tar-files-and-verify-its-co.patch application/x-patch 27.9 KB
v14-0012-pg_verifybackup-Tests-and-document.patch application/x-patch 24.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Jones 2024-09-12 11:08:01 Re: Psql meta-command conninfo+
Previous Message Nazir Bilal Yavuz 2024-09-12 10:58:28 Re: PG_TEST_EXTRA and meson