Re: pg_verifybackup: TAR format backup verification

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Sravan Kumar <sravanvcybage(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: pg_verifybackup: TAR format backup verification
Date: 2024-08-07 13:40:31
Message-ID: CAAJ_b975Pyo53P-zvkTPjVFErEANuRP9+K=2Qnifiy6BJBtwfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 6, 2024 at 10:39 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Aug 1, 2024 at 9:19 AM Amul Sul <sulamul(at)gmail(dot)com> wrote:
> > > I think I would have made this pass context->show_progress to
> > > progress_report() instead of the whole verifier_context, but that's an
> > > arguable stylistic choice, so I'll defer to you if you prefer it the
> > > way you have it. Other than that, this LGTM.
> >
> > Additionally, I moved total_size and done_size to verifier_context
> > because done_size needs to be accessed in astreamer_verify.c.
> > With this change, verifier_context is now more suitable.
>
> But it seems like 0006 now changes the logic for computing total_size.
> Prepatch, the condition is:
>
> - if (context->show_progress && !context->skip_checksums &&
> - should_verify_checksum(m))
> - context->total_size += m->size;
>
> where should_verify_checksum(m) checks (((m)->matched) && !((m)->bad)
> && (((m)->checksum_type) != CHECKSUM_TYPE_NONE)). But post-patch the
> condition is:
>
> + if (!context.skip_checksums)
> ...
> + if (!should_ignore_relpath(context, m->pathname))
> + total_size += m->size;
>
> The old logic was reached from verify_backup_directory() which does
> check should_ignore_relpath(), so the new condition hasn't added
> anything. But it seems to have lost the show_progress condition, and
> the m->checksum_type != CHECKSUM_TYPE_NONE condition. I think this
> means that we'll sum the sizes even when not displaying progress, and
> that if some of the files in the manifest had no checksums, our
> progress reporting would compute wrong percentages after the patch.
>

That is not true. The compute_total_size() function doesn't do
anything when not displaying progress, the first if condition, which
returns the same way as progress_report(). I omitted
should_verify_checksum() since we don't have match and bad flag
information at the start, and we won't have that for TAR files at all.
However, I missed the checksum_type check, which is necessary, and
have added it now.

With the patch, I am concerned that we won't be able to give an
accurate progress report as before. We add all the file sizes in the
backup manifest to the total_size without checking if they exist on
disk. Therefore, sometimes the reported progress completion might not
show 100% when we encounter files where m->bad or m->match == false at
a later stage. However, I think this should be acceptable since there
will be an error for the respective missing or bad file, and it can be
understood that verification is complete even if the progress isn't
100% in that case. Thoughts/Comments?

> > Understood. At the start of working on the v3 review, I thought of
> > completely discarding the 0007 patch and copying most of
> > verify_file_checksum() to a new function in astreamer_verify.c.
> > However, I later realized we could deduplicate some parts, so I split
> > verify_file_checksum() and moved the reusable part to a separate
> > function. Please have a look at v4-0007.
>
> Yeah, that seems OK.
>
> The fact that these patches don't have commit messages is making life
> more difficult for me than it needs to be. In particular, I'm looking
> at 0009 and there's no hint about why you want to do this. In fact
> that's the case for all of these refactoring patches. Instead of
> saying something like "tar format verification will want to verify the
> control file, but will not be able to read the file directly from
> disk, so separate the logic that reads the control file from the logic
> that verifies it" you just say what code you moved. Then I have to
> guess why you moved it, or flip back and forth between the refactoring
> patch and 0011 to try to figure it out. It would be nice if each of
> these refactoring patches contained a clear indication about the
> purpose of the refactoring in the commit message.
>

Sorry, I was a bit lazy there, assuming you'd handle the review :).
I can understand the frustration -- added some description.

> > I had the same thought about checking for NULL inside
> > should_verify_control_data(), but I wanted to maintain the structure
> > similar to should_verify_checksum(). Making this change would have
> > also required altering should_verify_checksum(), I wasn’t sure if I
> > should make that change before. Now, I did that in the attached
> > version -- 0008 patch.
>
> I believe there is no reason for this change to be part of 0008 at
> all, and that this should be part of whatever later patch needs it.
>

Ok

> > > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also.
> >
> > Done.
>
> OK, the formatting of 0011 looks much better now.
>
> It seems to me that 0011 is arranging to palloc the checksum context
> for every file and then pfree it at the end. It seems like it would be
> considerably more efficient if astreamer_verify contained a
> pg_checksum_context instead of a pointer to a pg_checksum_context. If
> you need a flag to indicate whether we've reinitialized the checksum
> for the current file, it's better to add that than to have all of
> these unnecessary allocate/free cycles.
>

I tried in the attached version, and it’s a good improvement. We don’t
need any flags; we can allocate that during astreamer creation. Later,
in the ASTREAMER_MEMBER_HEADER case while reading, we can
(re)initialize the context for each file as needed.

> Existing astreamer code uses struct member names_like_this. For the
> new one, you mostly used namesLikeThis except when you used
> names_like_this or namesLkThs.
>

Yeah, in my patch, I ended up using the same name for both the
variable and the function. To avoid that, I made this change. This
could be a minor inconvenience for someone using ctags/cscope to find
the definition of the function or variable, as they might be directed
to the wrong place. However, I think it’s still okay since there are
ways to find the correct definition. I reverted those changes in the
attached version.

> It seems to me that instead of adding a global variable
> verify_backup_file_cb, it would be better to move the 'format'
> variable into verifier_context. Then you can do something like if
> (context->format == 'p') verify_plain_backup_file() else
> verify_tar_backup_file().
>

Done.

> It's pretty common for .tar.gz to be abbreviated to .tgz. I think we
> should support that.
>

Done.

> Let's suppose that I have a backup which, for some reason, does not
> use the same compression for all files (base.tar, 16384.tgz,
> 16385.tar.gz, 16366.tar.lz4). With this patch, that will fail. Now,
> that's not really a problem, because having a backup with mixed
> compression algorithms like that is strange and you probably wouldn't
> try to do it. But on the other hand, it looks to me like making the
> code support that would be more elegant than what you have now.
> Because, right now, you have code to detect what type of backup you've
> got by looking at base.WHATEVER_EXTENSION ... but then you have to
> also have code that complains if some later file doesn't have the same
> extension. But you could just detect the type of every file
> individually.
>
> In fact, I wonder if we even need -Z. What value is that actually
> providing? Why not just always auto-detect?
>

+1, removed -Z option.

> find_backup_format() ignores the possibility of stat() throwing an
> error. That's not good.
>

I wasn't sure about that before -- I tried it in the attached version.
See if it looks good to you.

> Suppose that the backup directory contains main.tar, 16385.tar, and
> snuffleupagus.tar. It looks to me like what will happen here is that
> we'll verify main.tar with tblspc_oid = InvalidOid, 16385.tar with
> tblspc_oid = 16385, and snuffleupagus.tar with tblspc_oid =
> InvalidOid. That doesn't sound right. I think we should either
> completely ignore snuffleupagus.tar just as it were completely
> imaginary, or perhaps there's an argument for emitting a warning
> saying that we weren't expecting a snuffleupagus to exist.
>
> In general, I think all unexpected files in a tar-format backup
> directory should get the same treatment, regardless of whether the
> problem is with the extension or the file itself. We should either
> silently ignore everything that isn't expected to be present, or we
> should emit a complaint saying that the file isn't expected to be
> present. Right now, you say that it's "not a valid file" if the
> extension isn't what you expect (which doesn't seem like a good error
> message, because the file may be perfectly valid for what it is, it's
> just not a file we're expecting to see) and say nothing if the
> extension is right but the part of the filename preceding the
> extension is unexpected.
>

I added an error for files other than base.tar and
<tablespacesoid>.tar. I think the error message could be improved.

> A related issue is that it's a little unclear what --ignore is
> supposed to do for tar-format backups. Does that ignore files in the
> backup directory, or files instead of the tar files inside of the
> backup directory? If we decide that --ignore ignores files in the
> backup directory, then we should complain about any unexpected files
> that are present there unless they've been ignored. If we decide that
> --ignore ignores files inside of the tar files, then I suggest we just
> silently skip any files in the backup directory that don't seem to
> have file names in the correct format. I think I prefer the latter
> approach, but I'm not 100% sure what's best.
>

I am interested in having that feature to be as useful as possible --
I mean, allowing the option to ignore files from the backup directory
and from the archive file as well. I don't see any major drawbacks,
apart from spending extra CPU cycles to browse the ignore list.

Regards,
Amul

Attachment Content-Type Size
v8-0010-pg_verifybackup-Add-backup-format-and-compression.patch application/x-patch 6.2 KB
v8-0009-Refactor-move-first-and-last-progress_report-call.patch application/x-patch 1.8 KB
v8-0008-Refactor-split-verify_control_file.patch application/x-patch 5.6 KB
v8-0011-pg_verifybackup-Read-tar-files-and-verify-its-con.patch application/x-patch 25.1 KB
v8-0012-pg_verifybackup-Tests-and-document.patch application/x-patch 12.5 KB
v8-0006-Refactor-split-verify_backup_file-function.patch application/x-patch 6.6 KB
v8-0007-Refactor-split-verify_file_checksum-function.patch application/x-patch 2.9 KB
v8-0004-Refactor-move-few-global-variable-to-verifier_con.patch application/x-patch 5.1 KB
v8-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch application/x-patch 8.5 KB
v8-0001-Improve-file-header-comments-for-astramer-code.patch application/x-patch 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vitaly Davydov 2024-08-07 13:42:01 Re: Fsync (flush) all inserted WAL records
Previous Message Heikki Linnakangas 2024-08-07 12:55:33 Re: BlastRADIUS mitigation