Re: 12.2: Why do my Redo Logs disappear??

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Peter <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: 12.2: Why do my Redo Logs disappear??
Date: 2020-06-09 00:35:40
Message-ID: d105ee7a-19d5-3207-6311-a1dde37f95f5@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 6/8/20 5:02 PM, Peter wrote:
> Hi all,
> this is a 12.2 Release on FreeBSD 11.3.
>
> I am doing RedoLog Archiving according to Docs Chapter 25.1.

There is no ReDo logging, there is WAL logging.

What docs, because section 25.1 in the Postgres docs is :

https://www.postgresql.org/docs/12/backup-dump.html

25.1. SQL Dump

>
> During the last week I have lost 4 distinct Redo Logs; they are
> not in the backup.
>
> Loosing a RedoLog is very bad, because there is no redundancy,
> loosing a single one of them makes the timeline disappear and it
> will only reappear after another Base Backup. Very very bad.
>
> So I did analyze the matter.
>
> There are three ways to restart a Berkeley machine:
>
> 1. Cowboy it (aka pull-the-plug). This makes sure that everything is
> certainly dead immediately, and, given all hard- and software is
> well designed, nothing bad should happen.
>
> 2. Shut it down. This is the official means, and it takes very long,
> because each and every applications are signalled and given time to
> bring up whatever objections they may have.
>
> In this case "pg_ctl stop" will be invoked with whatever options the
> sysop has configured, and postgres will copy out a full log into
> archive before terminating.
>
> 3. Halt/Reboot it, like this:
> https://www.freebsd.org/cgi/man.cgi?query=reboot&sektion=8&manpath=FreeBSD+11.3-RELEASE
> This is considered more safe than pull-the-plug, and still fast.
> Applications are killed without much waiting, but all disk buffers
> are flushed to permanent storage and filesystems closed.
>
> In this case, it seems, Postgres will delete the current log
> without archiving it. :(
>
> What precisely happens (according to the OS sources) during reboot
> is this: processes will be sent SIGTERM, and after some 2-10
> seconds followed by SIGKILL.

https://www.postgresql.org/docs/12/server-shutdown.html
"
Important

It is best not to use SIGKILL to shut down the server. Doing so will
prevent the server from releasing shared memory and semaphores.
Furthermore, SIGKILL kills the postgres process without letting it relay
the signal to its subprocesses, so it might be necessary to kill the
individual subprocesses by hand as well.

To terminate an individual session while allowing other sessions to
continue, use pg_terminate_backend() (see Table 9.83) or send a SIGTERM
signal to the child process associated with the session."

What is RedoLog.bck?

>
> --------------------------------------
>
> Lets have a closer look:
>
> We did a regular shutdown at 17:09, and then we did a reboot at 19:24.
>
> Here is the content of the staging area (where the logs are
> copied to and accumulated until it is worth to run a backup job):
>
> # dir arch/
> total 240
> drwx------ 2 postgres postgres 5 Jun 8 17:09 .
> drwxr-xr-x 6 postgres postgres 7 Jun 8 17:09 ..
> -rw------- 1 postgres postgres 16777216 Jun 8 09:38 0000000100000017000000FC.ok
> -rw------- 1 postgres postgres 16777216 Jun 8 10:48 0000000100000017000000FD.ok
> -rw------- 1 postgres postgres 16777216 Jun 8 17:09 0000000100000017000000FE.ok
>
> And here is the pg_wal directory:
>
> # dir data12/pg_wal/
> total 89256
> drwx------ 3 postgres postgres 10 Jun 8 19:28 .
> drwx------ 19 postgres postgres 23 Jun 8 19:28 ..
> -rw------- 1 postgres postgres 335 Jun 7 07:36 0000000100000017000000EF.00000060.backup
> -rw------- 1 postgres postgres 16777216 Jun 8 19:38 000000010000001800000000
> -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000001
> -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000002
> -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000003
> -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000004
> -rw------- 1 postgres postgres 16777216 Jun 7 07:36 000000010000001800000005
> drwx------ 2 postgres postgres 3 Jun 8 17:09 archive_status
> # dir data12/pg_wal/archive_status
> total 23
> drwx------ 2 postgres postgres 3 Jun 8 17:09 .
> drwx------ 3 postgres postgres 10 Jun 8 19:28 ..
> -rw------- 1 postgres postgres 0 Jun 7 07:36 0000000100000017000000EF.00000060.backup.done
>
> Now where the hell is my "FF" log ???
>
> Lets check syslog - this was the full shutdown at 17:09:
>
> Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [10-1] :[] LOG: 00000: received fast shutdown request
> Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [10-2] :[] LOCATION: pmdie, postmaster.c:2780
> Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [11-1] :[] LOG: 00000: aborting any active transactions
> Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [11-2] :[] LOCATION: pmdie, postmaster.c:2813
> Jun 8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-1] :[] DEBUG: 00000: logical replication launcher shutting down
> Jun 8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-2] :[] LOCATION: ProcessInterrupts, postgres.c:2981
> Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [12-1] :[] LOG: 00000: background worker "logical replication launcher" (PID 73549) exited with exit code 1
> Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [12-2] :[] LOCATION: LogChildExit, postmaster.c:3657
> Jun 8 17:09:38 <local0.info> admn pg-bck[73544]: [13-1] :[] LOG: 00000: shutting down
> Jun 8 17:09:38 <local0.info> admn pg-bck[73544]: [13-2] :[] LOCATION: ShutdownXLOG, xlog.c:8321
> Jun 8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck invoked pg_wal/0000000100000017000000FE 0000000100000017000000FE
> Jun 8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck pg_wal/0000000100000017000000FE 0000000100000017000000FE returns 0
> Jun 8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-1] :[] DEBUG: 00000: archived write-ahead log file "0000000100000017000000FE"
> Jun 8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-2] :[] LOCATION: pgarch_archiveXlog, pgarch.c:675
> Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [14-1] :[] LOG: 00000: checkpoint starting: shutdown immediate
> Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [14-2] :[] LOCATION: LogCheckpointStart, xlog.c:8362
> Jun 8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-1] :[] DEBUG: 00000: performing replication slot checkpoint
> Jun 8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-2] :[] LOCATION: CheckPointReplicationSlots, slot.c:1078
> Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [16-1] :[] LOG: 00000: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=10.131 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB
> Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [16-2] :[] LOCATION: LogCheckpointEnd, xlog.c:8435
> Jun 8 17:09:55 <local0.info> admn pg-bck[73534]: [13-1] :[] LOG: 00000: database system is shut down
> Jun 8 17:09:55 <local0.info> admn pg-bck[73534]: [13-2] :[] LOCATION: UnlinkLockFiles, miscinit.c:859
>
> So, "FE" was copied out okay, and we have it in the staging area.
>
> And here is the relevant stuff from the following startup - here we
> consequentially are within the "FF" log:
>
> Jun 8 17:15:38 <local0.info> admn pg-bck[6366]: [8-1] :[] LOG: 00000: database system was shut down at 2020-06-08 17:09:55 CEST
> Jun 8 17:15:38 <local0.info> admn pg-bck[6366]: [8-2] :[] LOCATION: StartupXLOG, xlog.c:6242
> Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-1] :[] DEBUG: 00000: checkpoint record is at 17/FF000024
> Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-2] :[] LOCATION: StartupXLOG, xlog.c:6532
> Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [10-1] :[] DEBUG: 00000: redo record is at 17/FF000024; shutdown true
> ----------------------------------------
>
>>From the fast reboot @ 19:24 noting at all is logged.
>
> At the following startup, we see that we are still within the "FF"
> log:
>
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-1] :[] DEBUG: 00000: registering background worker "logical replication launcher"
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-2] :[] LOCATION: RegisterBackgroundWorker, bgworker.c:854
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [2-1] :[] LOG: 00000: starting PostgreSQL 12.2 on i386-portbld-freebsd11.3, compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [2-2] :[] LOCATION: PostmasterMain, postmaster.c:997
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [3-1] :[] LOG: 00000: listening on IPv4 address "0.0.0.0", port 5433
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [3-2] :[] LOCATION: StreamServerPort, pqcomm.c:590
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [4-1] :[] LOG: 00000: listening on Unix socket "/tmp/.s.PGSQL.5433"
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [4-2] :[] LOCATION: StreamServerPort, pqcomm.c:585
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-1] :[] LOG: 00000: ending log output to stderr
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-2] :[] HINT: Future log output will go to log destination "syslog".
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-3] :[] LOCATION: PostmasterMain, postmaster.c:1297
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [6-1] :[] LOG: XX000: could not send test message on socket for statistics collector: Permission denied
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [6-2] :[] LOCATION: pgstat_init, pgstat.c:486
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [7-1] :[] LOG: 00000: trying another address for the statistics collector
> Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [7-2] :[] LOCATION: pgstat_init, pgstat.c:418
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [8-1] :[] LOG: 00000: database system was interrupted; last known up at 2020-06-08 17:25:38 CEST
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [8-2] :[] LOCATION: StartupXLOG, xlog.c:6267
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-1] :[] DEBUG: 00000: checkpoint record is at 17/FF01BFE8
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-2] :[] LOCATION: StartupXLOG, xlog.c:6532
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-1] :[] DEBUG: 00000: redo record is at 17/FF01BFB4; shutdown false
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-2] :[] LOCATION: StartupXLOG, xlog.c:6609
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-1] :[] DEBUG: 00000: next transaction ID: 18955154; next OID: 145913
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-2] :[] LOCATION: StartupXLOG, xlog.c:6613
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-1] :[] DEBUG: 00000: next MultiXactId: 1; next MultiXactOffset: 0
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-2] :[] LOCATION: StartupXLOG, xlog.c:6617
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-1] :[] DEBUG: 00000: oldest unfrozen transaction ID: 479, in database 13777
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-2] :[] LOCATION: StartupXLOG, xlog.c:6620
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-1] :[] DEBUG: 00000: oldest MultiXactId: 1, in database 1
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-2] :[] LOCATION: StartupXLOG, xlog.c:6623
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-1] :[] DEBUG: 00000: commit timestamp Xid oldest/newest: 0/0
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-2] :[] LOCATION: StartupXLOG, xlog.c:6626
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-1] :[] DEBUG: 00000: transaction ID wrap limit is 2147484126, limited by database with OID 13777
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-2] :[] LOCATION: SetTransactionIdLimit, varsup.c:408
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-1] :[] DEBUG: 00000: MultiXactId wrap limit is 2147483648, limited by database with OID 1
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-2] :[] LOCATION: SetMultiXactIdLimit, multixact.c:2267
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-1] :[] DEBUG: 00000: starting up replication slots
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-2] :[] LOCATION: StartupReplicationSlots, slot.c:1114
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [19-1] :[] LOG: 00000: database system was not properly shut down; automatic recovery in progress
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [19-2] :[] LOCATION: StartupXLOG, xlog.c:6764
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-1] :[] DEBUG: 00000: resetting unlogged relations: cleanup 1 init 0
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-2] :[] LOCATION: ResetUnloggedRelations, reinit.c:55
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [21-1] :[] LOG: 00000: redo starts at 17/FF01BFB4
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [21-2] :[] LOCATION: StartupXLOG, xlog.c:7035
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [22-1] :[] LOG: 00000: redo done at 17/FF01C098
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [22-2] :[] LOCATION: StartupXLOG, xlog.c:7297
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-1] :[] DEBUG: 00000: resetting unlogged relations: cleanup 0 init 1
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-2] :[] LOCATION: ResetUnloggedRelations, reinit.c:55
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [24-1] :[] LOG: 00000: checkpoint starting: end-of-recovery immediate
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [24-2] :[] LOCATION: LogCheckpointStart, xlog.c:8362
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-1] :[] DEBUG: 00000: performing replication slot checkpoint
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-2] :[] LOCATION: CheckPointReplicationSlots, slot.c:1078
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [26-1] :[] LOG: 00000: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16272 kB, estimate=16272 kB
> Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [26-2] :[] LOCATION: LogCheckpointEnd, xlog.c:8435
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-1] :[] DEBUG: 00000: MultiXactId wrap limit is 2147483648, limited by database with OID 1
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-2] :[] LOCATION: SetMultiXactIdLimit, multixact.c:2267
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-1] :[] DEBUG: 00000: MultiXact member stop limit is now 4294914944 based on MultiXact 1
> Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-2] :[] LOCATION: SetOffsetVacuumLimit, multixact.c:2630
>
> There is no further information about anything concerning the logs,
> but at this point in time "FF" has disappeared. It was NOT copied out,
> we can see the timestamp on the pg_wal/archive_status still being at
> 17:09.
>
> Nothing except postgres is supposed to write/delete anything within
> the data tree, and the RedoLog.bck script doesn't delete anything at all.
>
> Another cute question would be: The "FE" log was successfully copied
> out at 17:09:45. The checkpoint at 17:09:55 then says "1 removed" -
> this should be the "FE", because the one before was written seven
> hours earlier (see the "ls" above) and should by long be gone.
>
> But then the checkpoint at 19:28:24 again says "1 removed". What was
> removed there? It is unlikely to be the "FE", and the "FF" would be
> currently in use - and the archive_status directory was not written
> since 17:09. But the "FF" has disappeared. So what is going on here?
>
> cheerio,
> PMc
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Geoghegan 2020-06-09 00:40:20 Re: 12.2: Why do my Redo Logs disappear??
Previous Message David G. Johnston 2020-06-09 00:16:44 Re: Is it possible to use keywords (date units) in a function definition?