| From: | David Steele <david(at)pgmasters(dot)net> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> | 
| Subject: | Re: pg_combinebackup fails on file named INCREMENTAL.* | 
| Date: | 2024-04-16 02:12:10 | 
| Message-ID: | 44b65bfe-847e-4710-847f-10585cc74eba@pgmasters.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 4/16/24 06:33, Robert Haas wrote:
> On Sun, Apr 14, 2024 at 11:53 PM David Steele <david(at)pgmasters(dot)net> wrote:
>> Since incremental backup is using INCREMENTAL as a keyword (rather than
>> storing incremental info in the manifest) it is vulnerable to any file
>> in PGDATA with the pattern INCREMENTAL.*.
> 
> Yeah, that's true. I'm not greatly concerned about this, because I
> think anyone who creates a file called INCREMENTAL.CONFIG is just
> being perverse. 
Well, it's INCREMENTAL.* and you might be surprised (though I doubt it) 
at what users will do. We've been caught out by wacky file names (and 
database names) a few times.
> However, we could ignore those files just in subtrees
> where they're not expected to occur i.e. only pay attention to them
> under base, global, and pg_tblspc. I didn't do this because I thought
> someone might eventually want to do something like incremental backup
> of SLRU files, and then it would be annoying if there were a bunch of
> random pathname restrictions. But if we're really concerned about
> this, I think it would be a reasonable fix; you're not really supposed
> to decide to store your configuration files in directories that exist
> for the purpose of storing relation data files.
I think it would be reasonable to restrict what can be put in base/ and 
global/ but users generally feel free to create whatever they want in 
the root of PGDATA, despite being strongly encouraged not to.
Anyway, I think it should be fixed or documented as a caveat since it 
causes a hard failure on restore.
>> We could fix the issue by forbidding this file pattern in PGDATA, i.e.
>> error when it is detected during pg_basebackup, but in my view it would
>> be better (at least eventually) to add incremental info to the manifest.
>> That would also allow us to skip storing zero-length files and
>> incremental stubs (with no changes) as files.
> 
> I did think about this, and I do like the idea of being able to elide
> incremental stubs. If you have a tremendous number of relations and
> very few of them have changed at all, the incremental stubs might take
> up a significant percentage of the space consumed by the incremental
> backup, which kind of sucks, even if the incremental backup is still
> quite small compared to the original database. And, like you, I had
> the idea of trying to use the backup_manifest to do it.
> 
> But ... I didn't really end up feeling very comfortable with it. Right
> now, the backup manifest is something we only use to verify the
> integrity of the backup. If we were to do this, it would become a
> critical part of the backup. 
For my 2c the manifest is absolutely a critical part of the backup. I'm 
having a hard time even seeing why we have the --no-manifest option for 
pg_combinebackup at all. If the manifest is missing who knows what else 
might be missing as well? If present, why wouldn't we use it?
I know Tomas added some optimizations that work best with --no-manifest 
but if we can eventually read compressed tars (which I expect to be the 
general case) then those optimizations are not very useful.
> I don't think I like the idea that
> removing the backup_manifest should be allowed to, in effect, corrupt
> the backup. But I think I dislike even more the idea that the data
> that is used to verify the backup gets mushed together with the backup
> data itself. Maybe in practice it's fine, but it doesn't seem very
> conceptually clean.
I don't think this is a problem. The manifest can do more than store 
verification info, IMO.
> There are also some practical considerations that made me not want to
> go in this direction: we'd need to make sure that the relevant
> information from the backup manifest was available to the
> reconstruction process at the right part of the code. When iterating
> over a directory, it would need to consider all of the files actually
> present in that directory plus any "hallucinated" incremental stubs
> from the manifest. I didn't feel confident of my ability to implement
> that in the available time without messing anything up.
I think it would be better to iterate over the manifest than files in a 
directory. In any case we still need to fix [1], which presents more or 
less the same problem.
> I think we should consider other possible designs here. For example,
> instead of including this in the manifest, we could ship one
> INCREMENTAL.STUBS file per directory that contains a list of all of
> the incremental stubs that should be created in that directory. My
> guess is that something like this will turn out to be way simpler to
> code.
So we'd store a mini manifest per directory rather than just put the 
info in the main manifest? Sounds like unnecessary complexity to me.
Regards,
-David
[1] 
https://www.postgresql.org/message-id/flat/9badd24d-5bd9-4c35-ba85-4c38a2feb73e%40pgmasters.net
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Hayato Kuroda (Fujitsu) | 2024-04-16 02:18:29 | RE: Slow catchup of 2PC (twophase) transactions on replica in LR | 
| Previous Message | David Rowley | 2024-04-16 02:04:22 | Re: Stability of queryid in minor versions |