Re: Spurious Stalls

From: Jaco Engelbrecht <jengelbrecht(at)atlassian(dot)com>
To: Bill Moran <wmoran(at)potentialtech(dot)com>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Christopher Nielsen <cnielsen(at)atlassian(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Spurious Stalls
Date: 2014-06-13 23:26:39
Message-ID: CAEorrh-rcSZhwd-yj=V90WWScTZSmh1SoDKXTQwoX4qOHxYL7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Bill,

On 13 June 2014 20:35, Bill Moran <wmoran(at)potentialtech(dot)com> wrote:
> On Fri, 13 Jun 2014 20:02:01 +0100 Jaco Engelbrecht <jengelbrecht(at)atlassian(dot)com> wrote:
>>
>> This coincides with a checkpoint:
>
> There's a lot going on here (as always with a busy server) but I suspected
> a checkpoint problem earlier, and this statement and your additional
> information makes me suspect even harder.
>
> In your earlier email with the .conf file, I seem to remember that
> you had checkpoint_segments set to 256 and checkpoint_timeout set to
> 30m. It's obvious from reading this thread that you've already put
> quite a bit of effort into resolving this. My question: have you
> tried _lowering_ the checkpoint settings? If we assume that the
> stall is related to checkpoint, and I'm remembering correctly on the
> settings, then PostgreSQL might have as much as 4G of wal logs to
> grind through to complete a checkpoint. While not huge, if that's
> trying to complete at the same time a lot of other work is going on,
> it could cause stalls. If you lower the chckpoint_segments and
> checkpoint_timeout, it will cause _more_ disk activity overall, but
> it will be spread out more. Whether or not this helps with your
> particular situation is dependent on whether your incidents are
> caused by a spike in activity (in which case it might help) or
> a cumulative effect of a lot of activity (in which case it will
> probably make the situation worse).

Thanks, we're going to look into that.

I checked our revision history to see if we changed any checkpoint
settings over the last year and we have not, however what I did notice
was that a few days before we first experienced this issue we
increased the wal_keep_segments from 256 to 1024 (and then later
further to 1536) in order to keep enough WAL records around for our
backups.

Sure enough, I just found a post at
http://www.postgresql.org/message-id/CAPVp=gbKVbNr1zQM_LKauNY-U1PHB++y=Xq26K-dXdDsffv_PQ@mail.gmail.com
describing a similar issue related to having wal_keep_segments set to
1024 (with a much lower checkpoint_segments set - 32) but no
resolution on the list.

Any thoughts on the wal_keep_segments we have set to 1024 currently?

> Another thing that I may be misremembering from from your earlier
> email: did you say that the load on the database was mostly write
> (or am I misremembering that you said the OS graphs were showing
> mostly write?) The reason I'm asking is that we've seen problems
> like you describe when trying to implement a high volume queue
> in PostgreSQL: the continuous INSERT/SELECT/DELETE grind on the
> single queue table was just more than PostgreSQL could keep up
> with. We moved that one portion of the application to Redis and
> everything else just fell in line. I'm stretching a bit to suppose
> that you have a similar problem, but it's another data point for
> you to consider.

Yes, mostly writes. We already use Redis for some aspects of the
site, but we'll look into what else we could move there.

Jaco

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Steve Kehlet 2014-06-14 01:53:24 Re: Spurious Stalls
Previous Message Si Chen 2014-06-13 21:54:19 Re: what does pg_activity mean when the database is stuck?