Re: Using pg_start_backup() and pg_stop_backup()

From: David B Harris <dbharris(at)eelf(dot)ddts(dot)net>
To: PostgreSQL mailing lists <pgsql-general(at)postgresql(dot)org>
Subject: Re: Using pg_start_backup() and pg_stop_backup()
Date: 2013-07-17 00:35:49
Message-ID: 20130717003549.GJ24198@valiant.tachsys.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed Jul 17, 09:16am +0900, Michael Paquier wrote:
> On Wed, Jul 17, 2013 at 8:24 AM, David B Harris <dbharris(at)eelf(dot)ddts(dot)net> wrote:
> > On Wed Jul 17, 08:12am +0900, Michael Paquier wrote:
> > 4. In parallel, WAL archiving has copied all the logs from the
> > start of pg_start_backup() through to the end of pg_stop_backup()
> Yes exactly. This avoids to have to maintain your own scripts and rely
> on the server features... What do you actually do to copy the
> necessary WAL files. Do you fetch them directly from the master's
> pg_xlog folder?

Due to bandwidth constraints separate scripts are going to need to be
maintained anyways. I'd love to use pg_basebackup, but since it
transfers the entire cluster each time, it's not suitable - something
like rsync is going to be used instead. Actually the use of
pg_basebackup would require a separate script anyway, so regardless. If
the WAL files do need to be copied, I'll be copying them separately
yeah.

(I'll use archive_command that only runs when backup_label is in place
[pg_startup() has been called], put them somewhere safe during the
cluster rsync [probably outside of the cluster directory], transfer them
after pg_stop_backup(), and then delete them on the source/master
server.)

> > A configuration like this is provided like this in the documentation
> > (again section 24.3), though some assembly is required. I'm hoping to
> > avoid it since it'll be introducing a second channel to the backup which
> > I can almost guarantee will be forgotten in time. (BTW, we tend to
> > consider the lifespan of our installations in terms of decades.)
> >
> > I mean, if it's necessary, the so be it of course, but the question
> > still remains: I've tested a fair bit under fairly adverse conditions
> > and not had a single failure. Luck?
>
> It looks so, and wal_keep_segments is set to a value high enough on
> the master side so as all the necessary WAL files are kept intact
> somewhere even if they are not archived.

I'm going to re-run my tests again, but here's what I did:

1. Set wal_keep_segments to a small value (I think I used 8)
2. Create a new database with a small canary table
2. pg_start_backup()
3. Run a bunch of database drops/loads and table drops/loads (with
pg_restore and COPY .. FROM, respectively) ... for hours and
hours (writing tens of GBs of data, without question cycling
through wal_keep_segments)
4. Take a tarball of the cluster (including pg_xlog/)
5. pg_stop_backup()
6. Restore from tarball, start cluster
7. Check for canary table (successfully)

Separately, I also prototyped the backup and I've run it dozens of times
occasionally with a heavy write load on the source/master server without
errors. Now the write load was likely never enough to saturate all
allocatable WAL segments, but given the documentation (which says that
the _last_ WAL segment which is synced by pg_stop_backup() must be
copied as well), I'd have expected a consistency error or similar.

Since somebody has now said "no you're crazy" (paraphrased :), I'm going
to double-check again.

I didn't mention in the original email, but of course one possibility is
that the documentation might be incomplete simply in that if
wal_keep_segments is set high enough such that the earliest segment
isn't overwritten while the rsync is running, everything would be fine.
But (again if I'm reading the documentation correctly), I think there
should be consistency errors (again unless I'm just lucky). Maybe if
there's an incomplete WAL transaction/segment/whatever, it just silently
ignores it. (This would of course make sense.)

I'll run the tests again and this time not use a canary table. It occurs
to me that what I might be seeing is data files in the cluster not being
deleted. If I'm dropping databases and/or tables, my base backup may be
getting the old ones but the new ones might be inconsistent/broken.
(Though I'd still expect a "cannot replay log" error of some sort at
cluster startup.)

--
Arguing with an engineer is like wrestling with a pig in mud.
After a while, you realise the pig is enjoying it.

OpenPGP v4 key ID: 4096R/59DDCB9F
Fingerprint: CC53 F124 35C0 7BC2 58FE 7A3C 157D DFD9 59DD CB9F
Retrieve from subkeys.pgp.net

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David B Harris 2013-07-17 01:21:09 Re: Using pg_start_backup() and pg_stop_backup()
Previous Message Michael Paquier 2013-07-17 00:16:50 Re: Using pg_start_backup() and pg_stop_backup()