Re: pg_combinebackup --incremental

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --incremental
Date: 2024-11-13 13:56:59
Message-ID: CAKZiRmyarJqnw126a9v0A8ZEYs3JQ=1tMLKRVffJEHb8CoeCBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 4, 2024 at 6:53 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Hi Robert,

[..]

1. Well, I have also the same bug as Bertrand which seems to be because
MacOS was used development rather than Linux (and thus MacOS doesnt have
copy_file_range(2)/HAVE_COPY_FILE_RANGE) --> I've simply fallen back to
undefHAVE_COPY_FILE_RANGE in my case, but patch needs to be fixed. I
haven't run any longer or more data-intensive tests as the
copy_file_range() seems to be missing and from my point of view that thing
is crucial.

2. While interleaving several incremental backups with pgbench, I've
noticed something strange by accident:

This will work:

$ pg_combinebackup -o /var/lib/postgresql/18/main fulldbbackup incr1 incr2
incr3 incr4

This will also work (new):

$ pg_combinebackup -i incr1 incr2 incr3 incr4 -o good_incr1_4
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup good_incr1_4

This will also work (new):

$ pg_combinebackup -i incr1 incr2 -o incr_12 #ok
$ pg_combinebackup -i incr_12 incr3 -o incr_13 #ok
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr_13

BUT, if I make intentional user error and if I merge the same incr2 into
both into two sets of incremental backups it won't reconstruct that:

$ pg_combinebackup -i incr1 incr2 -o incr1_2 # contains 1 + 2
$ pg_combinebackup -i incr2 incr3 -o incr2_3 # contains 2(!) + 3
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 # ofc works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr2_3 # fails?
pg_combinebackup: error: backup at "incr1_2" starts at LSN 0/B000028, but
expected 0/9000028

It seems to be taking LSN from incr1_2 and ignoring incr2_3 ?

$ find incr1 incr2 incr3 incr1_2 incr2_3 fulldbbackup -name backup_manifest
-exec grep -H LSN {} \;
incr1/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/9000028", "End-LSN":
"0/9000120" }
incr2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B000028", "End-LSN":
"0/B000120" }
incr3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D000028", "End-LSN":
"0/D000120" }
incr1_2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B000028",
"End-LSN": "0/B000120" }
incr2_3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D000028",
"End-LSN": "0/D000120" }
fulldbbackup/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/70000D8",
"End-LSN": "0/70001D0" }

So not sure should we cover that scenario or not ?

$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr3 # works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr3_4 # works
$ rm -rf /var/lib/postgresql/18/main && pg_combinebackup -o
/var/lib/postgresql/18/main fulldbbackup incr1_2 incr3_4 # two combined
sets - also work

4. Space saving feature seems to be there (I've tried to merge up to ~40
backups with rolling merging incr backup always after each incremental
backup), which seems to be the primary objective of the patch:

$ du -sm incr? incr?? incr1_38
38 incr1
25 incr2
[..]
24 incr37
24 incr38
[..above would be ~38*~25M = ~950MB]
87 incr1_38 # instead of ~950MB just 87MB

5. I've run accidently into independent problem when using
"pg_combinebackup -i `ls -1vd incr*` -o incr_ALL" (when dealing with dozens
of incrementals that are merged, I bet this is going to be pretty used
pattern), that pg_combinebackup was failing with
$ pg_combinebackup -i `ls -1vd incr*` -o incr_ALL
pg_combinebackup: error: incr26/global/pg_control: expected system
identifier 7436752340350991865, but found 7436753510269539237

The issue for me is that the check if the output directory should not exist
first, because it is taking incr_ALL here into ls(1) first while looking
for System-Identifiers and blowing up with error , before checking if that
-o dir doesn't exit:

$ grep System-Id ./incr_ALL/backup_manifest
"System-Identifier": 7436752340350991865,

So the issue is sequencing: first it should check if the incr_ALL does not
exist and only maybe later start inspecting backups to be combined?

6. Not sure, if that's critical, but it seems to require incremental
backups in order to be merging correctly , is that a limitation by design
or not ? (note --reverse in ls(1)):

$ rm -rf incr_ALL && pg_combinebackup -i `ls -1vd incr*` -o incr_ALL
$ rm -rf incr_ALL && pg_combinebackup -i `ls -1rvd incr*` -o incr_ALL
pg_combinebackup: error: backup at "incr2" starts at LSN 0/B000028, but
expected 0/70000D8

simpler:
$ rm -rf incr_ALL && pg_combinebackup -i incr1 incr2 incr3 -o incr_ALL
$ rm -rf incr_ALL && pg_combinebackup -i incr3 incr2 incr1 -o incr_ALL
pg_combinebackup: error: backup at "incr2" starts at LSN 0/B000028, but
expected 0/70000D8
$ find incr1 incr2 incr3 -name backup_manifest -exec grep -H LSN {} \; |
sort -nk 1
incr1/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/9000028", "End-LSN":
"0/9000120" }
incr2/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/B000028", "End-LSN":
"0/B000120" }
incr3/backup_manifest:{ "Timeline": 1, "Start-LSN": "0/D000028", "End-LSN":
"0/D000120" }

Nitpicking and other possibly not important things:

a. I'm still a fan of `--merge-incremental[-backups]` over `--incremental`
switch in pg_combinebackup and disabling the short `-i` switch :^)

b. pg_combinebackup help message has:
> -i, --incremental combine incrementals without a full backup
Maybe s/combine incrementals/merge incrementals backups/ as the
"incrementals" misses the "incremental of <what>"

c. If we are at copy_file_blocks(), couldn't we here simply report also
strerror(errno) in one of the parameters to pg_fatal during short write ? I
bet ENOSPC error message would be less vague:

pg_combinebackup: error: could not write to file "incr1_39/base/5/2613",
offset 9011200: wrote 327680 of 409600
pg_combinebackup: removing output directory "incr1_39"

-J.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2024-11-13 14:17:06 Re: Add reject_limit option to file_fdw
Previous Message Alvaro Herrera 2024-11-13 13:49:34 Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT