Backup advice

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Eduardo Morras <emorrasg(at)yahoo(dot)es>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, "sthomas(at)optionshouse(dot)com" <sthomas(at)optionshouse(dot)com>
Subject: Backup advice
Date: 2013-04-16 02:54:15
Message-ID: CAMkU=1z0+=M-2g-N3+y=d-QJu-qcYTGXbw_h1E5g5Gr+rCny6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
<emorrasg(at)yahoo(dot)es<javascript:_e({}, 'cvml', 'emorrasg(at)yahoo(dot)es');>
> wrote:

> On Mon, 8 Apr 2013 10:40:16 -0500
> Shaun Thomas <sthomas(at)optionshouse(dot)com <javascript:_e({}, 'cvml',
> 'sthomas(at)optionshouse(dot)com');>> wrote:
>
> >
> > Anyone else?
> >
>
> If his db has low inserts/updates/deletes he can use diff between pg_dumps
> (with default -Fp) before compressing.
>

Most "diff" implementations will read the entirety of both files into
memory, so may not work well with 200GB of data, unless it is broken into a
large number of much smaller files.

open-vcdiff only reads one of the files into memory, but I couldn't really
figure out what happens memory-wise when you try to undo the resulting
patch, the documentation is a bit mysterious.

xdelta3 will "work" on streamed files of unlimited size, but it doesn't
work very well unless the files fit in memory, or have the analogous data
in the same order between the two files.

A while ago I did some attempts to "co-compress" dump files, based on the
notion that the pg_dump text format does not have \n within records so it
is sortable as ordinary text, and that usually tables have their "stable"
columns, like a pk, near the beginning of the table and volatile columns
near the end, so that sorting the lines of several dump files together will
gather replicate or near-replicate lines together where ordinary
compression algorithms can work their magic. So if you tag each line with
its line number and which file it originally came from, then sort the lines
(skipping the tag), you get much better compression. But not nearly as
good as open-vcdiff, assuming you have the RAM to spare.

Using two dumps taken months apart on a slowly-changing database, it worked
fairly well:

cat 1.sql | pigz |wc -c
329833147

cat 2.sql | pigz |wc -c
353716759

cat 1.sql 2.sql | pigz |wc -c
683548147

sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print
"${.}b\t$_"' 2.sql) | pigz |wc -c
436350774

A certain file could be recovered by, for example:

zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' >
2.sql2

There all kinds of short-comings here, of course, it was just a quick and
dirty proof of concept.

For now I think storage is cheap enough for what I need to do to make this
not worth fleshing it out any more.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message ascot.moss@gmail.com 2013-04-16 07:57:23 Install PostgreSQL 9.2.4 to IBM Power System ppc64
Previous Message Adrian Klaver 2013-04-16 02:45:48 Re: Role Authentication Failure