From: | valgog(at)gmail(dot)com |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #7494: WAL replay speed depends heavily on the shared_buffers size |
Date: | 2012-08-15 10:10:42 |
Message-ID: | E1T1aYk-0007Jk-3h@wrigleys.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 7494
Logged by: Valentine Gogichashvili
Email address: valgog(at)gmail(dot)com
PostgreSQL version: 9.0.7
Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41)
Description:
We are experiencing strange(?) behavior on the replication slave machines.
The master machine has a very heavy update load, where many processes are
updating lots of data. It generates up to 30GB of WAL files per hour.
Normally it is not a problem for the slave machines to replay this amount of
WAL files on time and keep on with the master. But at some moments, the
slaves are “hanging” with 100% CPU usage on the WAL replay process and 3%
IOWait, needing up to 30 seconds to process one WAL file. If this tipping
point is reached, then a huge WAL replication lag is building up quite fast,
that also leads to overfill of the XLOG directory on the slave machines, as
the WAL receiver is putting the WAL files it gets via streaming replication
the XLOG directory (that, in many cases are quite a limited size separate
disk partition).
What we noticed also, is that reducing shared_buffers parameter from our
normal 20-32 GB for the slave machines, to 2 GB increases the speed of WAL
replay dramatically. After restart of the slave machine with much lower
shared_buffers values, the replay becomes up to 10-20 times faster.
On the attached graph, there is a typical graph of WAL replication delay for
one of the slaves.
In that graph small (up to 6GB) replication delay peaks during the night are
caused by some long running transactions, stopping WAL replay on this slave,
to prevent replication collisions. But the last, big peaks are sometimes
start because of that waiting for a long running transaction on the slave,
but then they are growing as described above.
I know, that there is only one process that replays data, generated by many
threads on master machine. But why does the replay performance depend so
much on the shared_buffers parameter and can it be optimized?
With best regards,
Valentine Gogichashvili
From | Date | Subject | |
---|---|---|---|
Next Message | Thom Brown | 2012-08-15 13:00:52 | Re: pg_dump dependency loop with extension containing its own schema |
Previous Message | Heikki Linnakangas | 2012-08-15 06:50:36 | Re: ERROR - CREATE GIST INDEX on 9.2 beta3 |