Re: Something else about Redo Logs disappearing

From: Peter <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Stephen Frost <sfrost(at)snowman(dot)net>, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Something else about Redo Logs disappearing
Date: 2020-06-15 12:50:22
Message-ID: 20200615125022.GA21249@gate.oper.dinoex.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Jun 15, 2020 at 11:44:33AM +0200, Laurenz Albe wrote:
! On Sat, 2020-06-13 at 19:48 +0200, Peter wrote:
! > ! > 4. If, by misconfiguration and/or operator error, the backup system
! > ! > happens to start a second backup. in parallel to the first,
! > ! > then do I correctly assume, both backups will be rendered
! > ! > inconsistent while this may not be visible to the operator; and
! > ! > the earlier backup would be flagged as apparently successful while
! > ! > carrying the wrong (later) label?
! > !
! > ! If you are using my scripts and start a second backup while the first
! > ! one is still running, the first backup will be interrupted.
! >
! > This is not what I am asking. It appears correct to me, that, on
! > the database, the first backup will be interrupted. But on the
! > tape side, this might go unnoticed, and on completion it will
! > successfully receive the termination code from the *SECOND*
! > backup - which means that on tape we will have a seemingly
! > successful backup, which
! > 1. is corrupted, and
! > 2. carries a wrong label.
!
! That will only happen if the backup that uses my scripts does the
! wrong thing.

Yes. Occasionally software does the wrong thing, it's called "bugs".

! An example:
!
! - Backup #1 calls "pgpre.sh"
! - Backup #1 starts copying files
! - Backup #2 calls "pgpre.sh".
! This will cancel the first backup.
! - Backup #1 completes copying files.
! - Backup #1 calls "pgpost.sh".
! It will receive an error.
! So it has to invalidate the backup.
! - Backup #2 completes copying files.
! - Backup #2 calls "pgpost.sh".
! It gets a "backup_label" file and completes the backup.

That's not true.

Now let me see how to compile a bash... and here we go:

! An example:
!
! - Backup #1 calls "pgpre.sh"

> $ ./pgpre.sh
> backup starting location: 1/C8000058
> $

We now have:
> 24129 10 SJ 0:00.00 /usr/local/bin/bash ./pgpre.sh
> 24130 10 SJ 0:00.00 /usr/local/bin/bash ./pgpre.sh
> 24131 10 SJ 0:00.01 psql -Atq
> 24158 10 SCJ 0:00.00 sleep 5

And:
> postgres=# \d
> List of relations
> Schema | Name | Type | Owner
> --------+--------+-------+----------
> public | backup | table | postgres
> (1 row)
>
> postgres=# select * from backup;
> id | state | pid | backup_label | tablespace_map
> ----+---------+-------+--------------+----------------
> 1 | running | 24132 | |
> (1 row)

! - Backup #1 starts copying files

Let's suppose it does now.

! - Backup #2 calls "pgpre.sh".

> $ ./pgpre.sh
> backup starting location: 1/C9000024
> $ FATAL: terminating connection due to administrator command
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> connection to server was lost
> Backup failed
> ./pgpre.sh: line 93: ${PSQL[1]}: ambiguous redirect
>
> $ echo $?
> 0

! This will cancel the first backup.

Yes, it seems it did:

> 25279 10 SJ 0:00.00 /usr/local/bin/bash ./pgpre.sh
> 25280 10 IWJ 0:00.00 /usr/local/bin/bash ./pgpre.sh
> 25281 10 SJ 0:00.01 psql -Atq
> 25402 10 SCJ 0:00.00 sleep 5

> postgres=# \d
> List of relations
> Schema | Name | Type | Owner
> --------+--------+-------+----------
> public | backup | table | postgres
> (1 row)
>
> postgres=# select * from backup;
> id | state | pid | backup_label | tablespace_map
> ----+---------+-------+--------------+----------------
> 1 | running | 25282 | |
> (1 row)

! - Backup #1 completes copying files.
! - Backup #1 calls "pgpost.sh".

> $ ./pgpost.sh
> START WAL LOCATION: 1/C9000024 (file 0000000100000001000000C9)
> CHECKPOINT LOCATION: 1/C9000058
> BACKUP METHOD: streamed
> BACKUP FROM: master
> START TIME: 2020-06-15 14:09:41 CEST
> LABEL: 2020-06-15 14:09:40
> START TIMELINE: 1
>
> $ echo $?
> 0

! It will receive an error.
! So it has to invalidate the backup.

Where is the error?

What we now have is this:
No processes anymore.

> id | state | pid | backup_label | tablespace_map
> ----+----------+-------+----------------------------------------------------------------+----------------
> 1 | complete | 25282 | START WAL LOCATION: 1/C9000024 (file 0000000100000001000000C9)+|
> | | | CHECKPOINT LOCATION: 1/C9000058 +|
> | | | BACKUP METHOD: streamed +|
> | | | BACKUP FROM: master +|
> | | | START TIME: 2020-06-15 14:09:41 CEST +|
> | | | LABEL: 2020-06-15 14:09:40 +|
> | | | START TIMELINE: 1 +|
> | | | |
> (1 row)

! - Backup #2 completes copying files.
! - Backup #2 calls "pgpost.sh".
! It gets a "backup_label" file and completes the backup.

Wishful thinking.

BOTH backups are now inconsistent, and the first got the label from
the second, and appears to be intact. Exactly as I said before.

I don't need to try such things out. I can do logical verification in
my mind, by looking at the code.

And on the same foundation I am saying that this whole new API is a
misconception.

cheerio,
PMc

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laurenz Albe 2020-06-15 13:19:29 Re: Something else about Redo Logs disappearing
Previous Message Niels Jespersen 2020-06-15 11:58:29 SV: pg_service.conf and client support