From: | David Steele <david(at)pgmasters(dot)net> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pg_combinebackup does not detect missing files |
Date: | 2024-04-22 00:47:10 |
Message-ID: | c5ac755e-5021-444d-b631-fac509e074c8@pgmasters.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 4/20/24 01:47, Robert Haas wrote:
> On Thu, Apr 18, 2024 at 7:36 PM David Steele <david(at)pgmasters(dot)net> wrote:
>> Sure -- pg_combinebackup would only need to verify the data that it
>> uses. I'm not suggesting that it should do an exhaustive verify of every
>> single backup in the chain. Though I can see how it sounded that way
>> since with pg_verifybackup that would pretty much be your only choice.
>>
>> The beauty of doing verification in pg_combinebackup is that it can do a
>> lot less than running pg_verifybackup against every backup but still get
>> a valid result. All we care about is that the output is correct -- if
>> there is corruption in an unused part of an earlier backup
>> pg_combinebackup doesn't need to care about that.
>>
>> As far as I can see, pg_combinebackup already checks most of the boxes.
>> The only thing I know that it can't do is detect missing files and that
>> doesn't seem like too big a thing to handle.
>
> Hmm, that's an interesting perspective. I've always been very
> skeptical of doing verification only around missing files and not
> anything else.
Yeah, me too. There should also be some verification of the file
contents themselves but now I can see we don't have that. For instance,
I can do something like this:
cp test/backup/incr1/base/1/3598 test/backup/incr1/base/1/2336
And pg_combinebackup runs without complaint. Maybe missing files are
more likely than corrupted files, but it would still be nice to check
for both.
> I figured that wouldn't be particularly meaningful, and
> that's pretty much the only kind of validation that's even
> theoretically possible without a bunch of extra overhead, since we
> compute checksums on entire files rather than, say, individual blocks.
> And you could really only do it for the final backup in the chain,
> because you should end up accessing all of those files, but the same
> is not true for the predecessor backups. So it's a very weak form of
> verification.
I don't think it is weak if you can verify that the output is exactly as
expected, i.e. all files are present and have the correct contents.
But in this case it would be nice to at least know that the source files
on disk are valid (which pg_verifybackup does). Without block checksums
it is hard to know if the final output is correct or not.
> But I looked into it and I think you're correct that, if you restrict
> the scope in the way that you suggest, we can do it without much
> additional code, or much additional run-time. The cost is basically
> that, instead of only looking for a backup_manifest entry when we
> think we can reuse its checksum, we need to do a lookup for every
> single file in the final input directory. Then, after processing all
> such files, we need to iterate over the hash table one more time and
> see what files were never touched. That seems like an acceptably low
> cost to me. So, here's a patch.
I tested the patch and it works, but there is some noise from WAL files
since they are not in the manifest:
$ pg_combinebackup test/backup/full test/backup/incr1 -o test/backup/combine
pg_combinebackup: warning: "pg_wal/000000010000000000000008" is present
on disk but not in the manifest
pg_combinebackup: error: "base/1/3596" is present in the manifest but
not on disk
Maybe it would be better to omit this warning since it could get very
noisy depending on the number of WAL segments generated during backup.
Though I do find it a bit odd that WAL files are not in the source
backup manifests but do end up in the manifest after a pg_combinebackup.
It doesn't seem harmful, just odd.
> I do think there's some chance that this will encourage people to
> believe that pg_combinebackup is better at finding problems than it
> really is or ever will be,
Given that pg_combinebackup is not verifying checksums, you are probably
right.
> and I also question whether it's right to
> keep changing stuff after feature freeze. But I have a feeling most
> people here are going to think this is worth including in 17. Let's
> see what others say.
I think it is a worthwhile change and we are still a month away from
beta1. We'll see if anyone disagrees.
Regards,
-David
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2024-04-22 02:36:13 | Assert failure in _bt_preprocess_array_keys |
Previous Message | Zhijie Hou (Fujitsu) | 2024-04-22 00:31:19 | RE: promotion related handling in pg_sync_replication_slots() |