Quick Links

Replaying 48 WAL files takes 80 minutes

From:	"Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To:	<pgsql-performance(at)postgresql(dot)org>
Subject:	Replaying 48 WAL files takes 80 minutes
Date:	2012-10-29 13:05:24
Message-ID:	D960CB61B694CF459DCFB4B0128514C2089A5F80@exadv11.host.magwien.gv.at
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

I am configuring streaming replication with hot standby
with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
PostgreSQL was compiled from source.

It works fine, except that starting the standby took for ever:
it took the system more than 80 minutes to replay 48 WAL files
and connect to the primary.

Can anybody think of an explanation why it takes that long?

This is decent hardware: 24 cores of AMD Opteron 6174, 128 GB RAM,
NetApp SAN attached with 8 GBit Fibrechannel (ext4 file system).
An identical system performed fine in performance tests.

Here is the log; I have edited it for readability:

2012-10-29 09:22:22.945 database system was interrupted; last known up
at 2012-10-26 01:11:59 CEST
2012-10-29 09:22:22.945 creating missing WAL directory
"pg_xlog/archive_status"
2012-10-29 09:22:22.947 entering standby mode
2012-10-29 09:22:23.434 restored log file "00000001000001D1000000C4"
from archive
2012-10-29 09:22:23.453 redo starts at 1D1/C4000020
2012-10-29 09:22:25.847 restored log file "00000001000001D1000000C5"
from archive
2012-10-29 09:22:27.457 restored log file "00000001000001D1000000C6"
from archive
2012-10-29 09:22:28.946 restored log file "00000001000001D1000000C7"
from archive
2012-10-29 09:22:30.421 restored log file "00000001000001D1000000C8"
from archive
2012-10-29 09:22:31.243 restored log file "00000001000001D1000000C9"
from archive
2012-10-29 09:22:32.194 restored log file "00000001000001D1000000CA"
from archive
2012-10-29 09:22:33.169 restored log file "00000001000001D1000000CB"
from archive
2012-10-29 09:22:33.565 restored log file "00000001000001D1000000CC"
from archive
2012-10-29 09:23:35.451 restored log file "00000001000001D1000000CD"
from archive

Everything is nice until here.
Replaying this WAL file suddenly takes 1.5 minutes instead
of mere seconds as before.

2012-10-29 09:24:54.761 restored log file "00000001000001D1000000CE"
from archive
2012-10-29 09:27:23.013 restartpoint starting: time
2012-10-29 09:28:12.200 restartpoint complete: wrote 242 buffers
(0.0%);
0 transaction log file(s) added, 0 removed, 0
recycled;
write=48.987 s, sync=0.185 s, total=49.184 s;
sync files=1096, longest=0.016 s, average=0.000
s
2012-10-29 09:28:12.206 recovery restart point at 1D1/CC618278
2012-10-29 09:28:31.226 restored log file "00000001000001D1000000CF"
from archive

Again there is a difference of 2.5 minutes
between these WAL files, only 50 seconds of
which were spent in the restartpoint.

From here on it continues in quite the same vein.
Some WAL files are restored in seconds, but some take
more than 4 minutes.

I'll skip to the end of the log:

2012-10-29 10:37:53.809 restored log file "00000001000001D1000000EF"
from archive
2012-10-29 10:38:53.194 restartpoint starting: time
2012-10-29 10:39:25.929 restartpoint complete: wrote 161 buffers
(0.0%);
0 transaction log file(s) added, 0 removed, 0
recycled;
write=32.661 s, sync=0.066 s, total=32.734 s;
sync files=251, longest=0.003 s, average=0.000
s
2012-10-29 10:39:25.929 recovery restart point at 1D1/ED95C728
2012-10-29 10:42:56.153 restored log file "00000001000001D1000000F0"
from archive
2012-10-29 10:43:53.062 restartpoint starting: time
2012-10-29 10:45:36.871 restored log file "00000001000001D1000000F1"
from archive
2012-10-29 10:45:39.832 restartpoint complete: wrote 594 buffers
(0.0%);
0 transaction log file(s) added, 0 removed, 0
recycled;
write=106.666 s, sync=0.093 s, total=106.769 s;
sync files=729, longest=0.004 s, average=0.000
s
2012-10-29 10:45:39.832 recovery restart point at 1D1/EF5D4340
2012-10-29 10:46:13.602 restored log file "00000001000001D1000000F2"
from archive
2012-10-29 10:47:38.396 restored log file "00000001000001D1000000F3"
from archive
2012-10-29 10:47:38.962 streaming replication successfully connected to
primary

I'd be happy if somebody could shed light on this.

Yours,
Laurenz Albe

PS: Here is the configuration:

Responses

Re: Replaying 48 WAL files takes 80 minutes at 2012-10-29 13:19:17 from ktm@rice.edu
Re: Replaying 48 WAL files takes 80 minutes at 2012-10-29 14:29:01 from Alvaro Herrera
Re: Replaying 48 WAL files takes 80 minutes at 2012-10-29 18:27:06 from Jeff
Re: Replaying 48 WAL files takes 80 minutes at 2012-10-29 18:36:36 from Jeff Trout

Browse pgsql-performance by date

	From	Date	Subject
Next Message	ktm@rice.edu	2012-10-29 13:19:17	Re: Replaying 48 WAL files takes 80 minutes
Previous Message	Matheus de Oliveira	2012-10-29 12:30:26	Re: How to upgrade from 9.1 to 9.2 with replication?