Quick Links

Re: BUG #9635: Wal sender process is using 100% CPU

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Jamie Koceniak <jkoceniak(at)mediamath(dot)com>
Cc:	"pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: BUG #9635: Wal sender process is using 100% CPU
Date:	2014-05-06 17:32:35
Message-ID:	53691CB3.3020908@vmware.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On 05/06/2014 07:44 PM, Jamie Koceniak wrote:
> Hi Heikki,
>
> Attached is a trace route for one of the pids that was running at 100% cpu.
> Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until the wal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in our application.
>
> Here is perf top results:
> 39.49% libz.so.1.2.7 [.] 0x000000000000301e
> 32.97% postgres [.] 0x00000000000cf150
> 6.50% postgres [.] nocachegetattr
> 2.63% postgres [.] btint4cmp
> 1.64% postgres [.] _start
> 1.40% [unknown] [.] 0x00007f9dfca6a34b
> 1.06% libc-2.13.so [.] 0x000000000012934b
> 0.94% postgres [.] slot_getattr
> 0.85% libcrypto.so.1.0.0 [.] 0x000000000009f48f
> 0.69% postgres [.] ExecProject
> 0.67% libz.so.1.2.7 [.] adler32
> 0.66% postgres [.] GetMemoryChunkSpace
> 0.58% [kernel] [k] copy_user_generic_string
> 0.45% postgres [.] heap_fill_tuple
> 0.38% postgres [.] execTuplesMatch
> 0.27% postgres [.] FunctionCall2Coll
> 0.27% postgres [.] ExecProcNode
> 0.25% postgres [.] heap_tuple_untoast_attr
> 0.22% postgres [.] MemoryContextAlloc
> 0.21% postgres [.] heap_form_minimal_tuple
> 0.21% postgres [.] pfree
> 0.19% postgres [.] heap_compute_data_size

Hmm, these results seem to include all processes, not just the WAL sender

> Here is some data using perf record and perf report:
>
> + 33.93% 193019 postgres postgres [.] 0x0000000000369e27
> + 30.06% 173825 postgres libz.so.1.2.7 [.] 0x00000000000036c0
> + 10.21% 60618 postgres [unknown] [.] 0x00007f9dfc2fd529
> + 4.81% 27275 postgres postgres [.] nocachegetattr
> + 2.55% 14443 postgres postgres [.] btint4cmp
> + 1.22% 6916 postgres postgres [.] ExecProject
> + 1.20% 6837 postgres postgres [.] _start
> + 1.10% 6495 postgres libc-2.13.so [.] 0x000000000012b8e9
> + 0.84% 4760 postgres postgres [.] heap_fill_tuple
> + 0.74% 4265 postgres [kernel.kallsyms] [k] copy_user_generic_string
> + 0.71% 4001 postgres postgres [.] GetMemoryChunkSpace
> + 0.66% 3881 postgres libcrypto.so.1.0.0 [.] 0x0000000000081ec7
> + 0.63% 3673 postgres libz.so.1.2.7 [.] adler32
> + 0.43% 2448 postgres postgres [.] slot_getattr
> + 0.40% 2277 postgres postgres [.] ExecProcNode
> + 0.40% 2250 postgres postgres [.] heap_form_minimal_tuple
> + 0.37% 2078 postgres postgres [.] heap_compute_data_size
> + 0.31% 1739 postgres postgres [.] ExecScan
> + 0.27% 1779 postgres [kernel.kallsyms] [k] page_fault
> + 0.24% 1374 postgres postgres [.] LogicalTapeWrite
> + 0.24% 4206 swapper [kernel.kallsyms] [k] intel_idle
> + 0.21% 1206 postgres postgres [.] slot_getsomeattrs

Same here.

Looks reasonable to me.

So, the strace output repeats this, with small variations in the number
of bytes written in each sendto() call:

recvfrom(10, 0x7f9dfec1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
read(7, ..., 131072) = 131072
sendto(10, ..., 3690, 0, NULL, 0) = 3690
sendto(10, ..., 3605, 0, NULL, 0) = 3605
sendto(10, ..., 3669, 0, NULL, 0) = 3669
sendto(10, ..., 3653, 0, NULL, 0) = 3653
sendto(10, ..., 3637, 0, NULL, 0) = 3637
sendto(10, ..., 3621, 0, NULL, 0) = 3621
sendto(10, ..., 3669, 0, NULL, 0) = 3669
sendto(10, ..., 3605, 0, NULL, 0) = 3605
sendto(10, ..., 53, 0, NULL, 0) = 53
getppid() = 5474

The read() reads 128kB of WAL from the file in pg_xlog, and it is then
sent to the client. Now, it's pretty surprising that the number of bytes
written with sendto() doesn't match the number of bytes read from the
file. Unless, you're using SSL with compression. That would also explain
the high amount of CPU time spent in libz in the perf output.

Try disabling SSL compression, with sslcompression=0 option in
primary_conninfo, or disable SSL altogether if the network is
trustworthy enough for that.

- Heikki

In response to

Re: BUG #9635: Wal sender process is using 100% CPU at 2014-05-06 16:44:42 from Jamie Koceniak

Responses

Re: BUG #9635: Wal sender process is using 100% CPU at 2014-05-06 23:09:18 from Jamie Koceniak

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Jeff Davis	2014-05-06 18:01:53	regression failure on master with --disable-integer-datetimes
Previous Message	Jamie Koceniak	2014-05-06 16:44:42	Re: BUG #9635: Wal sender process is using 100% CPU