Re: Server hanging with high CPU wait -- Checkpoint issue?

From: George Woodring <george(dot)woodring(at)iglass(dot)net>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Server hanging with high CPU wait -- Checkpoint issue?
Date: 2014-02-04 13:54:57
Message-ID: CACi+J=S=dQJXCZA8A_V+Lz20jCCSN-2pWA6rhoVmzqZrrhhyuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you for the response.

Our new server went down in memory from 20Gig to 16Gig. Our old server has
100 databases in the cluster, and we will be splitting up into multiple
servers so we thought the the decrease was acceptable.

dirty_background_ratio is 10 on the new box and 1 on the old. That might
be something we can look at.

If the bottle neck is IO, then it would have to be memory access. When our
system gets in the state, the writer and wal writer both get into wait,
iostat goes to 0, our network traffic goes to 0. After the freeze, iostat
increases and it catches up very easily.

Thanks again for the help, I will update if we find something useful

George

iGLASS Networks
www.iglass.net

On Mon, Feb 3, 2014 at 3:35 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Mon, Feb 3, 2014 at 11:16 AM, George Woodring <
> george(dot)woodring(at)iglass(dot)net> wrote:
>
>> I am running 9.3.2 on CentOS 6.5
>>
>> We have a new server we are migrating to that seems to lock up for up to
>> 30sec at a time with all of the processes in wait. I am seeing a weird
>> output with the checkpoint logs and am wondering if I need to adjust
>> something with the checkpointer config.
>>
>> My checkpoint settings are:
>>
>> checkpoint_completion_target = 0.9
>> checkpoint_timeout = 12min
>> checkpoint_segments = 50 # in logfile segments, min 1, 16MB each
>> checkpoint_warning = 300s # 0 disables
>>
>> When the box is running will, the checkpoint logs look like the
>> following, checkpoints restarting every 12 minutes and it completing in
>> about 10 minutes.
>>
>> Feb 3 11:00:09 sponge07b postgres[2754]: [2651-1] LOG: checkpoint
>> starting: time
>> Feb 3 11:10:57 sponge07b postgres[2754]: [2652-1] LOG: checkpoint
>> complete: wrote 32352 buffers (6.2%); 0 transaction log file(s) added, 0
>> removed, 14 recycled; write=647.749 s, sync=0.110 s, total=647.908 s; sync
>> files=424, longest=0.017 s, average=0.000 s
>>
>> ...
>
>
>> When we are having the issue just now, the checkpoint logs look like the
>> following. Checkpoints running every 12 minutes, but the checkpoint is
>> written after 2-3 minutes
>>
>> Feb 3 12:24:09 sponge07b postgres[2754]: [2665-1] LOG: checkpoint
>> starting: time
>> Feb 3 12:27:43 sponge07b postgres[2754]: [2666-1] LOG: checkpoint
>> complete: wrote 10127 buffers (1.9%); 0 transaction log file(s) added, 4
>> removed, 28 recycled; write=197.510 s, sync=15.975 s, total=213.746 s; sync
>> files=440, longest=9.089 s, average=0.022 s
>>
>>
> Probably what is happening here is that the backends or the bgwriter are
> writing out dirty buffers (because they need those buffers for other
> purposes) that have been marked for the checkpointer to write. The
> checkpointer then thinks it is falling behind, because it does not realize
> that other processes are clearing its flags, so it finishes early.
>
> This estimation problem perhaps can be fixed in a future version of
> PostgreSQL, for example by including in the fsync queue information about
> how many buffers flagged for the checkpointer have been written by others,
> or by adding a bit to the buffer headers for buffers that were flagged for
> the checkpointer but were written by someone else first.
>
> However, this would probably not solve the true problem, as I think the
> foreshortened checkpoint is probably just a symptom and not the cause. The
> most likely cause is that you have a spike in buffer dirtying, which both
> causes congestion in your IO system (which actually causes the freezes) and
> also causes the bgwriter to write out buffers flagged for the checkpointer
> which in turn causes the checkpointer to finish early. It is possible that
> fixing the estimation problem would give the IO more time to un-congest
> before the fsyncs start landing, but I am not optimistic about that being
> very effective.
>
> But all of that is for PostgreSQL's hacker to consider. From the
> perspective of a PostgreSQL user, I'd ask about the difference in RAM
> between the old and new server, and what the setting are
> for /proc/sys/vm/dirty_background_ratio and
> /proc/sys/vm/dirty_background_bytes? Probably your new server is allowing
> a lot more dirty data to build up in the kernel's buffers, which then
> causes serious congestion when those buffers need to get written for real.
>
> Cheers,
>
> Jeff
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Sergey Konoplev 2014-02-04 16:30:08 Re: Drop all overloads of a function without knowing parameter types
Previous Message Michael Paquier 2014-02-04 04:14:17 Re: pg_basebackup on standby node failed