From: | Claudio Freire <klaussfreire(at)gmail(dot)com> |
---|---|
To: | Tatsuo Ishii <ishii(at)postgresql(dot)org> |
Cc: | Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Implementing incremental backup |
Date: | 2013-06-20 00:04:12 |
Message-ID: | CAGTBQpbd5qXoTqFBRKvY0x41LQwF3uTYMPtGHuc28vtXVwCsFg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>> On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>>> * Claudio Freire (klaussfreire(at)gmail(dot)com) wrote:
>>>> I don't see how this is better than snapshotting at the filesystem
>>>> level. I have no experience with TB scale databases (I've been limited
>>>> to only hundreds of GB), but from my limited mid-size db experience,
>>>> filesystem snapshotting is pretty much the same thing you propose
>>>> there (xfs_freeze), and it works pretty well. There's even automated
>>>> tools to do that, like bacula, and they can handle incremental
>>>> snapshots.
>>>
>>> Large databases tend to have multiple filesystems and getting a single,
>>> consistent, snapshot across all of them while under load is..
>>> 'challenging'. It's fine if you use pg_start/stop_backup() and you're
>>> saving the XLOGs off, but if you can't do that..
>>
>> Good point there.
>>
>> I still don't like the idea of having to mark each modified page. The
>> WAL compressor idea sounds a lot more workable. As in scalable.
>
> Why do you think WAL compressor idea is more scalable? I really want
> to know why. Besides the unlogged tables issue, I can accept the idea
> if WAL based solution is much more efficient. If there's no perfect,
> ideal solution, we need to prioritize things. My #1 priority is
> allowing to create incremental backup against TB database, and the
> backup file should be small enough and the time to create it is
> acceptable. I just don't know why scanning WAL stream is much cheaper
> than recording modified page information.
Because it aggregates updates.
When you work at the storage manager level, you only see block-sized
operations. This results in the need to WAL-log bit-sized updates on
some hypothetical dirty-map index. Even when done 100% efficiently,
this implies at least one write per dirtied block, which could as much
as double write I/O in the worse (and totally expectable) case.
When you do it at WAL segment recycle time, or better yet during
checkpoints, you deal with checkpoint-scale operations. You can
aggregate dirty-map updates, if you keep a dirty-map, which could not
only reduce I/O considerably (by a much increased likelihood of write
coalescence), but also let you schedule it better (toss it in the
background, with checkpoints). This is for gathering dirty-map
updates, which still leaves you with the complex problem of then
actually snapshotting those pages consistently without interfering
with ongoing transactions.
If you do a WAL compressor, WAL entries are write-once, so you'll have
no trouble snapshotting those pages. You have the checkpoint's initial
full page write, so you don't even have to read the page, and you can
accumulate all further partial writes into one full page write, and
dump that on an "incremental archive". So, you get all the I/O
aggregation from above, which reduces I/O to the point where it only
doubles WAL I/O. It's bound by a constant, and in contrast to
dirty-map updates, it's sequential I/O so it's a lot faster. It's thus
perfectly scalable.
Not only that, but you're also amortizing incremental backup costs
over time, as you're making them constantly as opposed to regular
intervals. You'll have one incremental backup per checkpoint. If you
want to coalesce backups, you launch another compressor to merge the
last incremental checkpoint with the new one. And, now this is the
cherry on top, you only have to do this on the archived WALs, which
means you could very well do it on another system, freeing your main
cluster from all this I/O. It's thus perfectly scalable.
The only bottleneck here, is WAL archiving. This assumes you can
afford WAL archiving at least to a local filesystem, and that the WAL
compressor is able to cope with WAL bandwidth. But I have no reason to
think you'd be able to cope with dirty-map updates anyway if you were
saturating the WAL compressor, as the compressor is more efficient on
amortized cost per transaction than the dirty-map approach.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2013-06-20 01:08:33 | Re: event trigger API documentation? |
Previous Message | Stephen Frost | 2013-06-19 23:55:46 | Re: Implementing incremental backup |