Re: Server hanging with high CPU wait -- Checkpoint issue?

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: George Woodring <george(dot)woodring(at)iglass(dot)net>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Server hanging with high CPU wait -- Checkpoint issue?
Date: 2014-02-03 20:35:25
Message-ID: CAMkU=1yG12yQEQhFU4eXUb2E1AwN5M31ZN5UWatWVBsvFSes4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Feb 3, 2014 at 11:16 AM, George Woodring <george(dot)woodring(at)iglass(dot)net
> wrote:

> I am running 9.3.2 on CentOS 6.5
>
> We have a new server we are migrating to that seems to lock up for up to
> 30sec at a time with all of the processes in wait. I am seeing a weird
> output with the checkpoint logs and am wondering if I need to adjust
> something with the checkpointer config.
>
> My checkpoint settings are:
>
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 12min
> checkpoint_segments = 50 # in logfile segments, min 1, 16MB each
> checkpoint_warning = 300s # 0 disables
>
> When the box is running will, the checkpoint logs look like the following,
> checkpoints restarting every 12 minutes and it completing in about 10
> minutes.
>
> Feb 3 11:00:09 sponge07b postgres[2754]: [2651-1] LOG: checkpoint
> starting: time
> Feb 3 11:10:57 sponge07b postgres[2754]: [2652-1] LOG: checkpoint
> complete: wrote 32352 buffers (6.2%); 0 transaction log file(s) added, 0
> removed, 14 recycled; write=647.749 s, sync=0.110 s, total=647.908 s; sync
> files=424, longest=0.017 s, average=0.000 s
>
> ...

> When we are having the issue just now, the checkpoint logs look like the
> following. Checkpoints running every 12 minutes, but the checkpoint is
> written after 2-3 minutes
>
> Feb 3 12:24:09 sponge07b postgres[2754]: [2665-1] LOG: checkpoint
> starting: time
> Feb 3 12:27:43 sponge07b postgres[2754]: [2666-1] LOG: checkpoint
> complete: wrote 10127 buffers (1.9%); 0 transaction log file(s) added, 4
> removed, 28 recycled; write=197.510 s, sync=15.975 s, total=213.746 s; sync
> files=440, longest=9.089 s, average=0.022 s
>
>
Probably what is happening here is that the backends or the bgwriter are
writing out dirty buffers (because they need those buffers for other
purposes) that have been marked for the checkpointer to write. The
checkpointer then thinks it is falling behind, because it does not realize
that other processes are clearing its flags, so it finishes early.

This estimation problem perhaps can be fixed in a future version of
PostgreSQL, for example by including in the fsync queue information about
how many buffers flagged for the checkpointer have been written by others,
or by adding a bit to the buffer headers for buffers that were flagged for
the checkpointer but were written by someone else first.

However, this would probably not solve the true problem, as I think the
foreshortened checkpoint is probably just a symptom and not the cause. The
most likely cause is that you have a spike in buffer dirtying, which both
causes congestion in your IO system (which actually causes the freezes) and
also causes the bgwriter to write out buffers flagged for the checkpointer
which in turn causes the checkpointer to finish early. It is possible that
fixing the estimation problem would give the IO more time to un-congest
before the fsyncs start landing, but I am not optimistic about that being
very effective.

But all of that is for PostgreSQL's hacker to consider. From the
perspective of a PostgreSQL user, I'd ask about the difference in RAM
between the old and new server, and what the setting are
for /proc/sys/vm/dirty_background_ratio and
/proc/sys/vm/dirty_background_bytes? Probably your new server is allowing
a lot more dirty data to build up in the kernel's buffers, which then
causes serious congestion when those buffers need to get written for real.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rob Sargent 2014-02-03 20:36:38 Re: way to custom sort column by fixed strings, then by field's content
Previous Message Rob Sargent 2014-02-03 20:26:10 Re: way to custom sort column by fixed strings, then by field's content