Re: backup manifests

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: David Steele <david(at)pgmasters(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Tels <nospam-pg-abuse(at)bloodgate(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: backup manifests
Date: 2020-03-31 18:10:34
Message-ID: CA+TgmoawEeE5qpFgj5Vy2zZGKzd3ZSEhGrD_JdPqPd2GB8u1Cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 30, 2020 at 2:59 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think it wouldn't be too hard to compute that information while taking
> the base backup. We know the end timeline (ThisTimeLineID), so we can
> just call readTimeLineHistory(ThisTimeLineID). Which should then allow
> for something pretty trivial along the lines of
>
> timelines = readTimeLineHistory(ThisTimeLineID);
> last_start = InvalidXLogRecPtr;
> foreach(lc, timelines)
> {
> TimeLineHistoryEntry *he = lfirst(lc);
>
> if (he->end < startptr)
> continue;
>
> //
> manifest_emit_wal_range(Min(he->begin, startptr), he->end);
> last_start = he->end;
> }
>
> if (last_start == InvalidXlogRecPtr)
> start = startptr;
> else
> start = last_start;
>
> manifest_emit_wal_range(start, entptr);

I made an attempt to implement this. In the attached patch set, 0001
and 0002 are (I think) unmodified from the last version. 0003 is a
slightly-rejiggered version of your new pg_waldump option. 0004 whacks
0002 around so that the WAL ranges are included in the manifest and
pg_validatebackup tries to run pg_waldump for each WAL range. It
appears to work in light testing, but I haven't yet (1) tested it
extensively, (2) written good regression tests for it above and beyond
what pg_validatebackup had already, or (3) updated the documentation.
I'm going to work on those things. I would appreciate *very timely*
feedback on anything people do or do not like about this, because I
want to commit this patch set by the end of the work week and that
isn't very far away. I would also appreciate if people would bear in
mind the principle that half a loaf is better than none, and further
improvements can be made in future releases.

As part of my light testing, I tried promoting a standby that was
running pg_basebackup, and found that pg_basebackup failed like this:

pg_basebackup: error: could not get COPY data stream: ERROR: the
standby was promoted during online backup
HINT: This means that the backup being taken is corrupt and should
not be used. Try taking another online backup.
pg_basebackup: removing data directory "/Users/rhaas/pgslave2"

My first thought was that this error message is hard to reconcile with
this comment:

/*
* Send timeline history files too. Only the latest timeline history
* file is required for recovery, and even that only if there happens
* to be a timeline switch in the first WAL segment that contains the
* checkpoint record, or if we're taking a base backup from a standby
* server and the target timeline changes while the backup is taken.
* But they are small and highly useful for debugging purposes, so
* better include them all, always.
*/

But then it occurred to me that this might be a cascading standby.
Maybe the original master died and this machine's master got promoted,
so it has to follow a timeline switch but doesn't itself get promoted.
I think I might try to test out that scenario and see what happens,
but I haven't done so as of this writing. Regardless, it seems like a
really good idea to store a list of WAL ranges rather than a single
start/end/timeline, because even if it's impossible today it might
become possible in the future. Still, unless there's an easy way to
set up a test scenario where multiple WAL ranges need to be verified,
it may be hard to test that this code actually behaves properly.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v16-0001-Add-checksum-helper-functions.patch application/octet-stream 9.5 KB
v16-0002-Generate-backup-manifests-for-base-backups-and-v.patch application/octet-stream 117.8 KB
v16-0004-WIP-Store-WAL-ranges-in-manifest-and-validate-th.patch application/octet-stream 33.4 KB
v16-0003-pg_waldump-Add-quiet-option.patch application/octet-stream 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-03-31 18:18:31 Re: Ecpg dependency
Previous Message Justin Pryzby 2020-03-31 18:09:29 Re: Add A Glossary