From: | Marco Nenciarini <marco(dot)nenciarini(at)2ndquadrant(dot)it> |
---|---|
To: | |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Proposal: Incremental Backup |
Date: | 2014-07-29 16:35:51 |
Message-ID: | 53D7CD67.6000202@2ndquadrant.it |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Il 25/07/14 20:44, Robert Haas ha scritto:
> On Fri, Jul 25, 2014 at 2:21 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
>> <marco(dot)nenciarini(at)2ndquadrant(dot)it> wrote:
>>> 1. Proposal
>>> =================================
>>> Our proposal is to introduce the concept of a backup profile. The backup
>>> profile consists of a file with one line per file detailing tablespace,
>>> path, modification time, size and checksum.
>>> Using that file the BASE_BACKUP command can decide which file needs to
>>> be sent again and which is not changed. The algorithm should be very
>>> similar to rsync, but since our files are never bigger than 1 GB per
>>> file that is probably granular enough not to worry about copying parts
>>> of files, just whole files.
>>
>> That wouldn't nearly as useful as the LSN-based approach mentioned before.
>>
>> I've had my share of rsyncing live databases (when resizing
>> filesystems, not for backup, but the anecdotal evidence applies
>> anyhow) and with moderately write-heavy databases, even if you only
>> modify a tiny portion of the records, you end up modifying a huge
>> portion of the segments, because the free space choice is random.
>>
>> There have been patches going around to change the random nature of
>> that choice, but none are very likely to make a huge difference for
>> this application. In essence, file-level comparisons get you only a
>> mild speed-up, and are not worth the effort.
>>
>> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
>> the I/O of inspecting the LSN of entire segments (necessary
>> optimization for huge multi-TB databases) and backups only the
>> portions modified when segments do contain changes, so it's the best
>> of both worlds. Any partial implementation would either require lots
>> of I/O (LSN only) or save very little (file only) unless it's an
>> almost read-only database.
>
> I agree with much of that. However, I'd question whether we can
> really seriously expect to rely on file modification times for
> critical data-integrity operations. I wouldn't like it if somebody
> ran ntpdate to fix the time while the base backup was running, and it
> set the time backward, and the next differential backup consequently
> omitted some blocks that had been modified during the base backup.
>
Our proposal doesn't rely on file modification times for data integrity.
We are using the file mtime only as a fast indication that the file has
changed, and transfer it again without performing the checksum.
If timestamp and size match we rely on *checksums* to decide if it has
to be sent.
In "SMART MODE" we would use the file mtime to skip the checksum check
in some cases, but it wouldn't be the default operation mode and it will
have all the necessary warnings attached. However the "SMART MODE" isn't
a core part of our proposal, and can be delayed until we agree on the
safest way to bring it to the end user.
Regards,
Marco
--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco(dot)nenciarini(at)2ndQuadrant(dot)it | www.2ndQuadrant.it
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-07-29 16:38:16 | Re: Re: [GENERAL] pg_dump behaves differently for different archive formats |
Previous Message | Claudio Freire | 2014-07-29 16:35:46 | Re: Proposal: Incremental Backup |