Re: pg_combinebackup does not detect missing files

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup does not detect missing files
Date: 2024-04-18 14:50:10
Message-ID: CA+TgmoYX+mw48tTs=7-r-ZJiuiAGeHv4JhPeeQa4tJ8apZDjcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 17, 2024 at 7:09 PM David Steele <david(at)pgmasters(dot)net> wrote:
> I think here:
>
> + <application>pg_basebackup</application> only attempts to verify
>
> you mean:
>
> + <application>pg_combinebackup</application> only attempts to verify
>
> Otherwise this looks good to me.

Good catch, thanks. Committed with that change.

> Fair enough. I accept that your reasoning is not random, but I'm still
> not very satisfied that the user needs to run a separate and rather
> expensive process to do the verification when pg_combinebackup already
> has the necessary information at hand. My guess is that most users will
> elect to skip verification.

I think you're probably right that a lot of people will skip it; I'm
just less convinced than you are that it's a bad thing. It's not a
*great* thing if people skip it, but restore time is actually just
about the worst time to find out that you have a problem with your
backups. I think users would be better served by verifying stored
backups periodically when they *don't* need to restore them. Also,
saying that we have all of the information that we need to do the
verification is only partially true:

- we do have to parse the manifest anyway, but we don't have to
compute checksums anyway, and I think that cost can be significant
even for CRC-32C and much more significant for any of the SHA variants

- we don't need to read all of the files in all of the backups. if
there's a newer full, the corresponding file in older backups, whether
full or incremental, need not be read

- incremental files other than the most recent only need to be read to
the extent that we need their data; if some of the same blocks have
been changed again, we can economize

How much you save because of these effects is pretty variable. Best
case, you have a 2-backup chain with no manifest checksums, and all
verification will have to do that you wouldn't otherwise need to do is
walk each older directory tree in toto and cross-check which files
exist against the manifest. That's probably cheap enough that nobody
would be too fussed. Worst case, you have a 10-backup (or whatever)
chain with SHA512 checksums and, say, a 50% turnover rate. In that
case, I think having verification happen automatically could be a
pretty major hit, both in terms of I/O and CPU. If your database is
1TB, it's ~5.5TB of read I/O (because one 1TB full backup and 9 0.5TB
incrementals) instead of ~1TB of read I/O, plus the checksumming.

Now, obviously you can still feel that it's totally worth it, or that
someone in that situation shouldn't even be using incremental backups,
and it's a value judgement, so fair enough. But my guess is that the
efforts that this implementation makes to minimize the amount of I/O
required for a restore are going to be important for a lot of people.

> At least now they'll have the information they need to make an informed
> choice.

Right.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-04-18 14:57:18 Re: improve performance of pg_dump --binary-upgrade
Previous Message Alvaro Herrera 2024-04-18 14:08:58 Re: pgsql: Fix restore of not-null constraints with inheritance