Re: BUG #9635: Wal sender process is using 100% CPU

From: Jamie Koceniak <jkoceniak(at)mediamath(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #9635: Wal sender process is using 100% CPU
Date: 2014-05-05 19:37:26
Message-ID: 84CE67AB46ACDA41B18094177DD0853D212241@ORD2MBX04C.mex05.mlsrvr.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Heikki,

I performed an strace on one of the pids that was consuming 100% of the cpu.
This is just a portion of the log.
I also trimmed out some data provided in the quotes in the read and sends, not sure if you need it.
A majority of the calls in the trace log were recvfrom() and sendto().
EAGAIN (Resource temporarily unavailable) came up time and time again.

# strace log during 100% cpu utilization on the master

getppid() = 5474
recvfrom(10, 0x7f5dgac1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
read(7, ""..., 131072) = 131072
sendto(10, ""..., 3178, 0, NULL, 0) = 3178
sendto(10, ""..., 5381, 0, NULL, 0) = 5381
getppid() = 5474
recvfrom(10, 0x7f5dgac1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
read(7, ""..., 131072) = 131072
sendto(10, ""..., 3338, 0, NULL, 0) = 3338
getppid() = 5474
recvfrom(10, " ", 5, 0, NULL, NULL) = 5
recvfrom(10, "", 32, 0, NULL, NULL) = 32
recvfrom(10, "", 5, 0, NULL, NULL) = 5
recvfrom(10, ""..., 48, 0, NULL, NULL) = 48
fcntl(10, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(10, F_SETFL, O_RDWR) = 0
fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
recvfrom(10, 0x7f5dgac1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
select(11, [3 10], [10], NULL, {1, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
write(6, "\0", 1) = 1
rt_sigreturn(0x6) = -1 EINTR (Interrupted system call)
read(3, "\0", 16) = 1
getppid() = 5474
recvfrom(10, 5, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
sendto(10, ""..., 2715, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable)
read(3, 0x7ffl961b0780, 16) = -1 EAGAIN (Resource temporarily unavailable)
select(11, [3 10], [10], NULL, {1, 0}) = ? ERESTARTNOHAND (To be restarted)
--- SIGUSR1 (User defined signal 1) @ 0 (0) ---
write(6, "\0", 1) = 1
rt_sigreturn(0x6) = -1 EINTR (Interrupted system call)
read(3, "\0", 16) = 1
getppid() = 5474

I currently don't have perf installed.
I can get that installed and send results if you need it.

Thanks!

-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas(at)vmware(dot)com]
Sent: Tuesday, March 25, 2014 4:44 AM
To: Jamie Koceniak
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

On 03/19/2014 07:13 PM, jkoceniak(at)mediamath(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 9635
> Logged by: Jamie Koceniak
> Email address: jkoceniak(at)mediamath(dot)com
> PostgreSQL version: 9.1.9
> Operating system: x86_64-unknown-linux-gnu (Debian 4.7.2-5) 64-bit
> Description:
>
> Periodically throughout the day, we keep seeing the wal sender process
> utilize 100% of the CPU. We began noticing this after we added 2 new
> slave servers, going from 2 to 4 slaves. See top results and I also
> included our wal settings. Thanks!

Hmm. There was a bug that caused walsender to go into a busy-loop, when the standby disconnected. But that was fixed in 9.1.4 already.

Could you attach to one of the processes with gdb and see what functions are being called? Or strace, or "perf top" or something.

- Heikki

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2014-05-05 19:56:33 Re: BUG #10189: Limit in 9.3.4 no longer works when ordering using a composite multi-type index
Previous Message Tom Lane 2014-05-05 18:56:27 Re: BUG #10189: Limit in 9.3.4 no longer works when ordering using a composite multi-type index