Re: pg_standby stuck on a wal file size <16MB

From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Vladimir Kosilov" <vkosilov(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby stuck on a wal file size <16MB
Date: 2008-05-17 19:07:07
Message-ID: 65937bea0805171207i3a2814bfl56c128ef83ed123a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, May 18, 2008 at 12:17 AM, Vladimir Kosilov <vkosilov(at)gmail(dot)com>
wrote:

> I've been having problem with pgsql-8.2.5 master/slave warm standby
> replication setup where occasionally master node generates a wal file
> smaller then expected 16MB. pg_standby on slave gets stuck on such short
> files, and replication halts from that moment on. we have to do
> pg_start_backup/ rsync of data to slave / pg_stop_backup and restart slave
> in order to recover. database I'm replicating is write mostly.
>
> this is process on slave that is waiting on log file which is smaller in
> size:
> postgres 14277 11857 0 03:34 ? 00:00:00 sh -c
> /usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t
> /usr/local/recovery_control/quit_recovery /usr/local/wal_archive
> 000000010000034000000020 pg_xlog/RECOVERYXLOG 2>>
> /usr/local/pgsql/data/standby.log
> postgres 14278 14277 0 03:34 ? 00:00:00
> /usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t
> /usr/local/recovery_control/quit_recovery /usr/local/wal_archive
> 000000010000034000000020 pg_xlog/RECOVERYXLOG
>
> here is a sample list of wal_archive directory slave, note the size of
> 000000010000034000000020 expected wal file is less then 16MB:
>
> ...
> -rw------- 1 postgres postgres 16777216 May 17 03:19
> 000000010000034000000017
> -rw------- 1 postgres postgres 16777216 May 17 03:21
> 000000010000034000000018
> -rw------- 1 postgres postgres 16777216 May 17 03:22
> 000000010000034000000019
> -rw------- 1 postgres postgres 16777216 May 17 03:24
> 00000001000003400000001A
> -rw------- 1 postgres postgres 16777216 May 17 03:26
> 00000001000003400000001B
> -rw------- 1 postgres postgres 16777216 May 17 03:27
> 00000001000003400000001C
> -rw------- 1 postgres postgres 16777216 May 17 03:29
> 00000001000003400000001D
> -rw------- 1 postgres postgres 16777216 May 17 03:30
> 00000001000003400000001E
> -rw------- 1 postgres postgres 16777216 May 17 03:32
> 00000001000003400000001F
>
> -rw------- 1 postgres postgres 13746176 May 17 03:34
> 000000010000034000000020
>
> -rw------- 1 postgres postgres 16777216 May 17 03:35
> 000000010000034000000021
> -rw------- 1 postgres postgres 16777216 May 17 03:37
> 000000010000034000000022
> -rw------- 1 postgres postgres 16777216 May 17 03:38
> 000000010000034000000023
> -rw------- 1 postgres postgres 16777216 May 17 03:40
> 000000010000034000000024
> -rw------- 1 postgres postgres 16777216 May 17 03:41
> 000000010000034000000025
> -rw------- 1 postgres postgres 16777216 May 17 03:43
> 000000010000034000000026
> -rw------- 1 postgres postgres 16777216 May 17 03:45
> 000000010000034000000027
> ...
>
> skipping a bit further I see there is at least one other instance where wal
> file is shorter then normal 16MB:
>
> -rw------- 1 postgres postgres 16777216 May 17 05:42
> 00000001000003400000006F
> -rw------- 1 postgres postgres 16777216 May 17 05:44
> 000000010000034000000070
> -rw------- 1 postgres postgres 16777216 May 17 05:46
> 000000010000034000000071
> -rw------- 1 postgres postgres 16777216 May 17 05:47
> 000000010000034000000072
>
> -rw------- 1 postgres postgres 16384 May 17 05:50
> 000000010000034000000073
>
> -rw------- 1 postgres postgres 16777216 May 17 05:51
> 000000010000034000000074
> -rw------- 1 postgres postgres 16777216 May 17 05:52
> 000000010000034000000075
> -rw------- 1 postgres postgres 16777216 May 17 05:54
> 000000010000034000000076
>
> why would a master node create a wal file smaller then normal
> checkpoint_segment size and how can this be avoided. I need a reliable
> replication mechanism even at a cost of longer recovery on standby.
>
> master's postgresql.conf:
> ..
> fsync = on
> wal_sync_method = open_sync
> wal_buffers = 128
> checkpoint_segments = 64
> archive_command = 'test ! -f /usr/local/wal_archive_local/%f && cp %p
> /usr/local/wal_archive_local/%f'
>
> archive files are then moved on master to standby every other minute:
>
> rsync -aq --remove-sent-files /usr/local/wal_archive_local/
> slave::wal_archive/
>
> slave's recovery.conf:
> restore_command = '/usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w
> 0 -t /usr/local/recovery_control/quit_recovery /usr/local/wal_archive %f %p
> 2>> /u
> sr/local/pgsql/data/standby.log'
>
> both servers are identical Dell PE1950 servers with 4 sas hd w/hardware
> RAID 1+0 running:
> 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 EDT 2007 x86_64 x86_64 x86_64
> GNU/Linux
>
> Thanks much in advance
> V.

Any reason why you are not using rsync command in the archive_command in the
first place?

gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Christian.Strobl 2008-05-17 20:05:19 WG: [GENERAL] reproducible database crash with simple sql command on postgres 8.3.1
Previous Message Vladimir Kosilov 2008-05-17 18:47:29 pg_standby stuck on a wal file size <16MB