SR Slave leaves off every 32nd wal segment ?!

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: SR Slave leaves off every 32nd wal segment ?!
Date: 2013-05-21 20:17:44
Message-ID: 20130521201744.GA13666@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,
We have following situation:
Pg master on US east coast.
Pg slave on US west coast.

Replication set using omnipitr + streaming replication.

Setup on slave:
postgres=# select name, setting from pg_settings where name ~ '(checkpo|wal)';
name | setting
------------------------------+-------------
checkpoint_completion_target | 0.9
checkpoint_segments | 30
checkpoint_timeout | 300
checkpoint_warning | 30
log_checkpoints | on
max_wal_senders | 3
wal_block_size | 8192
wal_buffers | 2048
wal_keep_segments | 0
wal_level | hot_standby
wal_receiver_status_interval | 10
wal_segment_size | 2048
wal_sync_method | fdatasync
wal_writer_delay | 200
(14 rows)

Replication is working usually well:
$ grep -c 'FATAL:..could not receive data from WAL stream:' postgresql-2013-05-*
postgresql-2013-05-14.log:7
postgresql-2013-05-15.log:7
postgresql-2013-05-16.log:9
postgresql-2013-05-17.log:9
postgresql-2013-05-18.log:8
postgresql-2013-05-19.log:8
postgresql-2013-05-20.log:7
postgresql-2013-05-21.log:7

After each such break, Pg switches to omnipitr, which does recover some wal files, and then it switches back to SR. All looks fine.

But. In pg_xlog on slave we have lots of files - almost 500 now. I.e. WAL segments.

What's weird is their numbering. List of 500 files is too big for now, so just last 50:

pg_xlog$ ls -l | tail -n 50
-rw------- 1 postgres postgres 16777216 May 16 22:41 000000010000008B00000009
-rw------- 1 postgres postgres 16777216 May 17 01:16 000000010000008B00000029
-rw------- 1 postgres postgres 16777216 May 17 03:56 000000010000008B00000049
-rw------- 1 postgres postgres 16777216 May 17 06:36 000000010000008B00000069
-rw------- 1 postgres postgres 16777216 May 17 09:16 000000010000008B00000089
-rw------- 1 postgres postgres 16777216 May 17 11:56 000000010000008B000000A9
-rw------- 1 postgres postgres 16777216 May 17 14:36 000000010000008B000000C9
-rw------- 1 postgres postgres 16777216 May 17 17:16 000000010000008B000000E9
-rw------- 1 postgres postgres 16777216 May 17 19:56 000000010000008C0000000A
-rw------- 1 postgres postgres 16777216 May 17 22:36 000000010000008C0000002A
-rw------- 1 postgres postgres 16777216 May 18 01:12 000000010000008C0000004A
-rw------- 1 postgres postgres 16777216 May 18 03:52 000000010000008C0000006A
-rw------- 1 postgres postgres 16777216 May 18 06:32 000000010000008C0000008A
-rw------- 1 postgres postgres 16777216 May 18 09:13 000000010000008C000000AA
-rw------- 1 postgres postgres 16777216 May 18 14:33 000000010000008C000000EA
-rw------- 1 postgres postgres 16777216 May 18 17:13 000000010000008D0000000B
-rw------- 1 postgres postgres 16777216 May 18 19:53 000000010000008D0000002B
-rw------- 1 postgres postgres 16777216 May 18 22:33 000000010000008D0000004B
-rw------- 1 postgres postgres 16777216 May 19 01:05 000000010000008D0000006B
-rw------- 1 postgres postgres 16777216 May 19 03:45 000000010000008D0000008B
-rw------- 1 postgres postgres 16777216 May 19 06:25 000000010000008D000000AB
-rw------- 1 postgres postgres 16777216 May 19 09:05 000000010000008D000000CB
-rw------- 1 postgres postgres 16777216 May 19 11:45 000000010000008D000000EB
-rw------- 1 postgres postgres 16777216 May 19 14:25 000000010000008E0000000C
-rw------- 1 postgres postgres 16777216 May 19 17:05 000000010000008E0000002C
-rw------- 1 postgres postgres 16777216 May 19 19:45 000000010000008E0000004C
-rw------- 1 postgres postgres 16777216 May 19 22:25 000000010000008E0000006C
-rw------- 1 postgres postgres 16777216 May 20 01:01 000000010000008E0000008C
-rw------- 1 postgres postgres 16777216 May 20 03:41 000000010000008E000000AC
-rw------- 1 postgres postgres 16777216 May 20 06:21 000000010000008E000000CC
-rw------- 1 postgres postgres 16777216 May 20 09:01 000000010000008E000000EC
-rw------- 1 postgres postgres 16777216 May 20 11:41 000000010000008F0000000D
-rw------- 1 postgres postgres 16777216 May 20 14:21 000000010000008F0000002D
-rw------- 1 postgres postgres 16777216 May 20 17:01 000000010000008F0000004D
-rw------- 1 postgres postgres 16777216 May 20 19:41 000000010000008F0000006D
-rw------- 1 postgres postgres 16777216 May 20 22:21 000000010000008F0000008D
-rw------- 1 postgres postgres 16777216 May 21 01:00 000000010000008F000000AD
-rw------- 1 postgres postgres 16777216 May 21 03:35 000000010000008F000000CD
-rw------- 1 postgres postgres 16777216 May 21 06:15 000000010000008F000000ED
-rw------- 1 postgres postgres 16777216 May 21 08:55 00000001000000900000000E
-rw------- 1 postgres postgres 16777216 May 21 11:35 00000001000000900000002E
-rw------- 1 postgres postgres 16777216 May 21 14:15 00000001000000900000004E
-rw------- 1 postgres postgres 16777216 May 21 16:55 00000001000000900000006E
-rw------- 1 postgres postgres 16777216 May 21 19:35 00000001000000900000008E
-rw------- 1 postgres postgres 16777216 May 21 20:00 000000010000009000000093
-rw------- 1 postgres postgres 16777216 May 21 20:05 000000010000009000000094
-rw------- 1 postgres postgres 16777216 May 21 20:10 000000010000009000000095
-rw------- 1 postgres postgres 16777216 May 21 20:14 000000010000009000000096
-rw------- 1 postgres postgres 16777216 May 21 19:55 000000010000009000000097
drwx------ 2 postgres postgres 28672 May 21 20:10 archive_status

Aside from couple of last ones, every 32nd wal segment is kept for some reason.

Pg is 9.2.4.

All these "older" (that is aside from 5 newest) have *.ready 0-byte files in archive_status.

archiving on slave is /bin/true only:

postgres=# select name, setting from pg_settings where name ~ '(archive)';
name | setting
---------------------------+-----------
archive_command | /bin/true
archive_mode | on
archive_timeout | 3600
max_standby_archive_delay | 30
(4 rows)

I am at loss, no idea what could be wrong.
I mean - sure, I can remove all those old files, and we'll be fine, but new
ones will most likely accumulate, and removing them by hand, or from a cronjob
is kind of pointless.

Is there anything I could check that could lead to finding cause of this weird behavior?

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/

Browse pgsql-general by date

  From Date Subject
Next Message Kjetil Jørgensen 2013-05-21 20:21:51 Contents of data/base/<oid> and no corresponding entry in pg_database
Previous Message Dann Corbit 2013-05-21 19:45:22 Re: [ODBC] ODBC constructs