Re: Missing important information in backup.sgml

From: "Gunnar \"Nick\" Bluth" <gunnar(dot)bluth(at)pro-open(dot)de>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: "pgsql-docs(at)postgresql(dot)org" <pgsql-docs(at)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: Missing important information in backup.sgml
Date: 2016-11-29 14:54:57
Message-ID: d1b49a36-6e90-d8c6-6242-66d26ae32b6f@pro-open.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-docs

Am 29.11.2016 um 14:43 schrieb Stephen Frost:
> Greetings,
>
> * Gunnar "Nick" Bluth (gunnar(dot)bluth(at)pro-open(dot)de) wrote:
>> Am 23.11.2016 um 21:41 schrieb Stephen Frost:
>>> * Gunnar "Nick" Bluth (gunnar(dot)bluth(at)pro-open(dot)de) wrote:
>>> One of the very important things that should be done as part of a backup
>>> is to ensure that all of the archive files required to restore the
>>> database to a consistent state are safely stored in the archive. If
>>> that isn't done then it's possible that an incomplete archive may also
>>> render backups invalid.
>>
>> Well, the need to have a complete archive is described in the docs
>> already. Maybe the potential consequences of an incomplete archive
>> should be pointed or more drastically...?
>
> We should probably add to "step 5" something about "verify that all WAL
> files have been archived between the start and stop backup" or similar.

+1

>> Now, the main purpose of my patch was to document a behaviour that many
>> of us have run into, namely that FATAL error showing up in the log when
>> the archive_command exits with RC > 127. It's a nuisance only, but it
>> does send people on false tracks and should at least be mentioned in the
>> documentation.
>
> I agree that we should add information to the documentation that certain
> error codes are handled differently.
>
>> And since a couple of people does use rsync (or some wrappers around it)
>> for archiving, and that is notoriously giving RCs > 127, it seems legit
>> to at least mention it, no?
>
> The low-level API documentation should be focused on the API and not how
> to (mis)use the API using common unix commands.
>
>> diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
>> index 6eaed1e..a8f574e 100644
>> --- a/doc/src/sgml/backup.sgml
>> +++ b/doc/src/sgml/backup.sgml
>> @@ -587,7 +587,8 @@ tar -cf backup.tar /usr/local/pgsql/data
>> the administrator specify a shell command to be executed to copy a
>> completed segment file to wherever it needs to go. The command could be
>> as simple as a <literal>cp</>, or it could invoke a complex shell
>> - script &mdash; it's all up to you.
>> + script &mdash; it's all up to you. There however are some things to consider
>> + when creating such a command, most of which are covered below.
>> </para>
>
> If we're changing this then we should remove the notion that it could be
> "as simple as a cp" because it really should *not* be. That is a
> dangerous and very poor recommendation to be making to our users.

That thought creeped on me as well... ;-)

>> <para>
>> @@ -636,7 +637,11 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 &amp;&amp; cp pg_wal/0
>> <productname>PostgreSQL</> will assume that the file has been
>> successfully archived, and will remove or recycle it. However, a nonzero
>> status tells <productname>PostgreSQL</> that the file was not archived;
>> - it will try again periodically until it succeeds.
>> + it will try again periodically until it succeeds. Note that an exit
>> + status of 128 or higher will cause the archiver to exit, resulting in a
>> + <literal>FATAL</> error in the server log. It will be restarted by the
>> + postmaster and continue where it left. E.g., <command>rsync</> is known
>> + for returning exit statuses of 255 on network issues.
>> </para>
>
> I agree with adding documentation about what happens with different exit
> status values. I don't believe we should make any mention of rsync. A
> sophisticated enough user to understand exit codes should be able to
> work out what the exit code is for whatever tool or tools they're using,
> it's not on us to document what the exit codes are for every possible
> tool (nor should one even use something as simple as a bare cp or rsync
> in archive_command).

Fair enough.

>
>> <para>
>> @@ -696,6 +701,16 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 &amp;&amp; cp pg_wal/0
>> preserve the file name (<literal>%f</>).
>> </para>
>>
>> + <para>
>> + Depending on your specific requirements (and datacenter layout), you may
>> + want to make sure that the archived WAL segments have been written out to
>> + persistent storage before the <command>archive_command</> returns.
>> + Otherwise, a WAL segment that is assumed by <productname>PostgreSQL</>
>> + to be archived could be recycled or removed prematurely, rendering your
>> + archive incomplete (and thus disabling a recovery) in case of an outage of
>> + the archiving destination.
>> + </para>
>
> I dislike the tone of this. The wording I would suggest would be more
> along the lines of:
>
> The archive_command should only return success once the WAL segment has
> been completely copied, written out, and synced to persistent storage as
> PostgreSQL will recycle or remove the segment shortly after the
> archive_command returns. If the WAL segment is lost due to an outage of
> the archive server or other issue, any backup which was performed during
> the time that the WAL segment was written will be unusable and
> Point-In-Time-Recovery from an earlier backup will not be usable past
> the missing WAL segment.

Sounds good to me. I'll incorparate that and send -v6 soon!

--
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar(dot)bluth(at)pro-open(dot)de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne

In response to

Browse pgsql-docs by date

  From Date Subject
Next Message Ian Barwick 2016-12-01 01:37:08 monitoring.sgml - clarify length of query text displayed in pg_stat_statements
Previous Message Stephen Frost 2016-11-29 13:43:23 Re: Missing important information in backup.sgml