Re: Strange times in WAL files in archive directory (9.3)

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: Strange times in WAL files in archive directory (9.3)
Date: 2017-01-24 11:10:01
Message-ID: 58873609.4030004@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On 24/01/2017 11:12, Achilleas Mantzios wrote:
> Hello,
> we are running PostgreSQL 9.3 on a new VM running Linux 3.16.0-4-amd64 #1 SMP, Debian 8.6, and we setup WAL archiving yesterday. We were planing to keep a window of 15 days worth of WAL files, so we
> thought of writing a script to delete files older than 15 days. The settings used were rather common :
> archive_mode = on
> archive_command = '/usr/bin/scp %p sma:/smadb/pgsql/pitr/%f'
> archive_timeout=240
> However, this morning, to my surprise I saw this by doing ls on the archive directory :
> # ls -l /smadb/pgsql/pitr/ | less
> total 12680296
> -rw------- 1 postgres postgres 16777216 Jan 24 03:44 000000020000070100000000
> -rw------- 1 postgres postgres 16777216 Jan 24 03:50 000000020000070100000001
> -rw------- 1 postgres postgres 16777216 Jan 24 03:54 000000020000070100000002
> -rw------- 1 postgres postgres 16777216 Jan 24 04:00 000000020000070100000003
> -rw------- 1 postgres postgres 16777216 Jan 24 04:00 000000020000070100000004
> -rw------- 1 postgres postgres 16777216 Jan 24 04:04 000000020000070100000005
> -rw------- 1 postgres postgres 16777216 Jan 24 04:09 000000020000070100000006
> -rw------- 1 postgres postgres 16777216 Jan 24 04:14 000000020000070100000007
> -rw------- 1 postgres postgres 16777216 Jan 24 04:20 000000020000070100000008
> -rw------- 1 postgres postgres 16777216 Jan 24 04:20 000000020000070100000009
> -rw------- 1 postgres postgres 16777216 Jan 24 04:25 00000002000007010000000A
> -rw------- 1 postgres postgres 16777216 Jan 24 04:29 00000002000007010000000B
> -rw------- 1 postgres postgres 16777216 Jan 24 04:34 00000002000007010000000C
> -rw------- 1 postgres postgres 16777216 Jan 24 04:39 00000002000007010000000D
> -rw------- 1 postgres postgres 16777216 Jan 24 04:39 00000002000007010000000E
> -rw------- 1 postgres postgres 16777216 Jan 24 04:44 00000002000007010000000F
> ...........
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000701000000EF
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000701000000F0
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000701000000F1 <------
> -rw------- 1 postgres postgres 16777216 Jan 23 13:04 0000000200000701000000F2 <------ time jumps
> -rw------- 1 postgres postgres 16777216 Jan 23 13:05 0000000200000701000000F3
> -rw------- 1 postgres postgres 16777216 Jan 23 13:06 0000000200000701000000F4
>
> So IMHO either the naming of the WAL filename got damaged somehow or the one of the servers got wrong time or something I miss. "Jan 23 13:04" is indeed the time that we started the server for WAL
> archiving.
> By sorting by time I get :
> ls -ltr /smadb/pgsql/pitr/ | less
> -rw------- 1 postgres postgres 16777216 Jan 23 13:04 0000000200000701000000F2
> -rw------- 1 postgres postgres 16777216 Jan 23 13:05 0000000200000701000000F3
> -rw------- 1 postgres postgres 16777216 Jan 23 13:06 0000000200000701000000F4
> -rw------- 1 postgres postgres 16777216 Jan 23 13:07 0000000200000701000000F5
> -rw------- 1 postgres postgres 16777216 Jan 23 13:08 0000000200000701000000F6
> -rw------- 1 postgres postgres 16777216 Jan 23 13:09 0000000200000701000000F7
> -rw------- 1 postgres postgres 16777216 Jan 23 13:09 0000000200000701000000F8
> -rw------- 1 postgres postgres 16777216 Jan 23 13:10 0000000200000701000000F9
> -rw------- 1 postgres postgres 16777216 Jan 23 13:11 0000000200000701000000FA
> -rw------- 1 postgres postgres 16777216 Jan 23 13:12 0000000200000701000000FB
> -rw------- 1 postgres postgres 16777216 Jan 23 13:13 0000000200000701000000FC
> ........
> -rw------- 1 postgres postgres 16777216 Jan 24 03:15 0000000200000702000000F9
> -rw------- 1 postgres postgres 16777216 Jan 24 03:19 0000000200000702000000FA
> -rw------- 1 postgres postgres 16777216 Jan 24 03:23 0000000200000702000000FB
> -rw------- 1 postgres postgres 16777216 Jan 24 03:27 0000000200000702000000FC
> -rw------- 1 postgres postgres 16777216 Jan 24 03:31 0000000200000702000000FD
> -rw------- 1 postgres postgres 16777216 Jan 24 03:35 0000000200000702000000FE
> -rw------- 1 postgres postgres 16777216 Jan 24 03:39 0000000200000702000000FF
> -rw------- 1 postgres postgres 16777216 Jan 24 03:43 000000020000070300000000
> -rw------- 1 postgres postgres 16777216 Jan 24 03:44 000000020000070100000000 <----- problem begins
> -rw------- 1 postgres postgres 16777216 Jan 24 03:47 000000020000070300000001
> -rw------- 1 postgres postgres 16777216 Jan 24 03:50 000000020000070100000001
> -rw------- 1 postgres postgres 16777216 Jan 24 03:51 000000020000070300000002
> -rw------- 1 postgres postgres 16777216 Jan 24 03:54 000000020000070100000002
> ........ (same pattern goes on 0000000200000703 mixing with 0000000200000701 )
> -rw------- 1 postgres postgres 16777216 Jan 24 09:42 0000000200000701000000EE
> -rw------- 1 postgres postgres 16777216 Jan 24 09:45 0000000200000703000000EF
> -rw------- 1 postgres postgres 16777216 Jan 24 09:47 0000000200000703000000F0
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000703000000F1
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000701000000EF
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000701000000F0
> -rw------- 1 postgres postgres 16777216 Jan 24 09:48 0000000200000701000000F1 <----- problem ends
> -rw------- 1 postgres postgres 16777216 Jan 24 09:50 0000000200000703000000F2
> -rw------- 1 postgres postgres 16777216 Jan 24 09:52 0000000200000703000000F3
> -rw------- 1 postgres postgres 16777216 Jan 24 09:56 0000000200000703000000F4
>
> Any thoughts about any explanation on this?
>
By scrolling up the terminal on the database server (cause the said file is gone now from pg_xlog) I saw this :
$ ls -ltr ~/data/pg_xlog/0000000200000701000000F0
-rw------- 1 postgres postgres 16777216 Jan 23 13:01 /var/lib/pgsql/data/pg_xlog/0000000200000701000000F0
$ md5sum ~/data/pg_xlog/0000000200000701000000F0
8ce1917c86d3c9e41455ec13659c2689 /var/lib/pgsql/data/pg_xlog/0000000200000701000000F0

In the archive's host I see this :
# ls -ltr /smadb/pgsql/pitr/0000000200000701000000F0
-rw------- 1 postgres postgres 16777216 Jan 24 09:48 /smadb/pgsql/pitr/0000000200000701000000F0
# md5sum /smadb/pgsql/pitr/0000000200000701000000F0
8ce1917c86d3c9e41455ec13659c2689 /smadb/pgsql/pitr/0000000200000701000000F0

Same file, different mod times, so, this makes me think that maybe there was some flushing that the postgresql server did, prior to deleting or recycling some old WAL files. Also I see this :
# ls -ltr /smadb/pgsql/pitr/0000000200000701000000*
-rw------- 1 postgres postgres 16777216 Jan 23 13:04 /smadb/pgsql/pitr/0000000200000701000000F2
-rw------- 1 postgres postgres 16777216 Jan 23 13:05 /smadb/pgsql/pitr/0000000200000701000000F3
-rw------- 1 postgres postgres 16777216 Jan 23 13:06 /smadb/pgsql/pitr/0000000200000701000000F4
-rw------- 1 postgres postgres 16777216 Jan 23 13:07 /smadb/pgsql/pitr/0000000200000701000000F5
-rw------- 1 postgres postgres 16777216 Jan 23 13:08 /smadb/pgsql/pitr/0000000200000701000000F6
-rw------- 1 postgres postgres 16777216 Jan 23 13:09 /smadb/pgsql/pitr/0000000200000701000000F7
-rw------- 1 postgres postgres 16777216 Jan 23 13:09 /smadb/pgsql/pitr/0000000200000701000000F8
-rw------- 1 postgres postgres 16777216 Jan 23 13:10 /smadb/pgsql/pitr/0000000200000701000000F9
-rw------- 1 postgres postgres 16777216 Jan 23 13:11 /smadb/pgsql/pitr/0000000200000701000000FA
-rw------- 1 postgres postgres 16777216 Jan 23 13:12 /smadb/pgsql/pitr/0000000200000701000000FB
-rw------- 1 postgres postgres 16777216 Jan 23 13:13 /smadb/pgsql/pitr/0000000200000701000000FC
-rw------- 1 postgres postgres 16777216 Jan 23 13:14 /smadb/pgsql/pitr/0000000200000701000000FD
-rw------- 1 postgres postgres 16777216 Jan 23 13:15 /smadb/pgsql/pitr/0000000200000701000000FE
-rw------- 1 postgres postgres 16777216 Jan 23 13:16 /smadb/pgsql/pitr/0000000200000701000000FF
-rw------- 1 postgres postgres 16777216 Jan 24 03:44 /smadb/pgsql/pitr/000000020000070100000000 <---- time jump
-rw------- 1 postgres postgres 16777216 Jan 24 03:50 /smadb/pgsql/pitr/000000020000070100000001
-rw------- 1 postgres postgres 16777216 Jan 24 03:54 /smadb/pgsql/pitr/000000020000070100000002
-rw------- 1 postgres postgres 16777216 Jan 24 04:00 /smadb/pgsql/pitr/000000020000070100000003
-rw------- 1 postgres postgres 16777216 Jan 24 04:00 /smadb/pgsql/pitr/000000020000070100000004
......
-rw------- 1 postgres postgres 16777216 Jan 24 09:42 /smadb/pgsql/pitr/0000000200000701000000EE
-rw------- 1 postgres postgres 16777216 Jan 24 09:48 /smadb/pgsql/pitr/0000000200000701000000EF
-rw------- 1 postgres postgres 16777216 Jan 24 09:48 /smadb/pgsql/pitr/0000000200000701000000F0
-rw------- 1 postgres postgres 16777216 Jan 24 09:48 /smadb/pgsql/pitr/0000000200000701000000F1

So this makes me believe that all the files before 0000000200000701000000F2 were present before WAL archiving was setup. So I want to ask :

1) Do you find this behavior normal?
2) From now on, can we assume a monotonic relation between file names and creation times in order to base our maintenance scripts on this?

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2017-01-24 13:29:10 Re: Strange times in WAL files in archive directory (9.3)
Previous Message Achilleas Mantzios 2017-01-24 09:12:06 Strange times in WAL files in archive directory (9.3)