Re: Backup advice

From: Eduardo Morras <emorrasg(at)yahoo(dot)es>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Backup advice
Date: 2013-04-16 11:50:13
Message-ID: 20130416135013.94873e45a8ee1365f28d2554@yahoo.es
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 15 Apr 2013 19:54:15 -0700
Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras
> <emorrasg(at)yahoo(dot)es<javascript:_e({}, 'cvml', 'emorrasg(at)yahoo(dot)es');>
> > wrote:
>
> > On Mon, 8 Apr 2013 10:40:16 -0500
> > Shaun Thomas <sthomas(at)optionshouse(dot)com <javascript:_e({}, 'cvml',
> > 'sthomas(at)optionshouse(dot)com');>> wrote:
> >
> > >
> > > Anyone else?
> > >
> >
> > If his db has low inserts/updates/deletes he can use diff between pg_dumps
> > (with default -Fp) before compressing.
> >
>
> Most "diff" implementations will read the entirety of both files into
> memory, so may not work well with 200GB of data, unless it is broken into a
> large number of much smaller files.
>
> open-vcdiff only reads one of the files into memory, but I couldn't really
> figure out what happens memory-wise when you try to undo the resulting
> patch, the documentation is a bit mysterious.
>
> xdelta3 will "work" on streamed files of unlimited size, but it doesn't
> work very well unless the files fit in memory, or have the analogous data
> in the same order between the two files.

I use for my 12-13 GB dump files:

git diff -p 1.sql 2.sql > diff.patch

It uses 4MB for firts phase and upto 140MB on last one and makes a patch file that can be recovered with:

patch 1.sql < diff.patch > 2.sql

or using git apply.

> A while ago I did some attempts to "co-compress" dump files, based on the
> notion that the pg_dump text format does not have \n within records so it
> is sortable as ordinary text, and that usually tables have their "stable"
> columns, like a pk, near the beginning of the table and volatile columns
> near the end, so that sorting the lines of several dump files together will
> gather replicate or near-replicate lines together where ordinary
> compression algorithms can work their magic. So if you tag each line with
> its line number and which file it originally came from, then sort the lines
> (skipping the tag), you get much better compression. But not nearly as
> good as open-vcdiff, assuming you have the RAM to spare.
>
> Using two dumps taken months apart on a slowly-changing database, it worked
> fairly well:
>
> cat 1.sql | pigz |wc -c
> 329833147
>
> cat 2.sql | pigz |wc -c
> 353716759
>
> cat 1.sql 2.sql | pigz |wc -c
> 683548147
>
> sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print
> "${.}b\t$_"' 2.sql) | pigz |wc -c
> 436350774
>
> A certain file could be recovered by, for example:
>
> zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' >
> 2.sql2

Be careful, some z* utils decompress the whole file on /tmp (zdiff).

> There all kinds of short-comings here, of course, it was just a quick and
> dirty proof of concept.

A nice one !

> For now I think storage is cheap enough for what I need to do to make this
> not worth fleshing it out any more.
>
> Cheers,
>
> Jeff

--- ---
Eduardo Morras <emorrasg(at)yahoo(dot)es>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2013-04-16 13:46:50 Re: Role Authentication Failure
Previous Message Tatsuo Ishii 2013-04-16 08:02:08 Re: Install PostgreSQL 9.2.4 to IBM Power System ppc64