| From: | Jaco Engelbrecht <jengelbrecht(at)atlassian(dot)com> |
|---|---|
| To: | Bill Moran <wmoran(at)potentialtech(dot)com> |
| Cc: | Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Christopher Nielsen <cnielsen(at)atlassian(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
| Subject: | Re: Spurious Stalls |
| Date: | 2014-06-13 23:26:39 |
| Message-ID: | CAEorrh-rcSZhwd-yj=V90WWScTZSmh1SoDKXTQwoX4qOHxYL7w@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Hi Bill,
On 13 June 2014 20:35, Bill Moran <wmoran(at)potentialtech(dot)com> wrote:
> On Fri, 13 Jun 2014 20:02:01 +0100 Jaco Engelbrecht <jengelbrecht(at)atlassian(dot)com> wrote:
>>
>> This coincides with a checkpoint:
>
> There's a lot going on here (as always with a busy server) but I suspected
> a checkpoint problem earlier, and this statement and your additional
> information makes me suspect even harder.
>
> In your earlier email with the .conf file, I seem to remember that
> you had checkpoint_segments set to 256 and checkpoint_timeout set to
> 30m. It's obvious from reading this thread that you've already put
> quite a bit of effort into resolving this. My question: have you
> tried _lowering_ the checkpoint settings? If we assume that the
> stall is related to checkpoint, and I'm remembering correctly on the
> settings, then PostgreSQL might have as much as 4G of wal logs to
> grind through to complete a checkpoint. While not huge, if that's
> trying to complete at the same time a lot of other work is going on,
> it could cause stalls. If you lower the chckpoint_segments and
> checkpoint_timeout, it will cause _more_ disk activity overall, but
> it will be spread out more. Whether or not this helps with your
> particular situation is dependent on whether your incidents are
> caused by a spike in activity (in which case it might help) or
> a cumulative effect of a lot of activity (in which case it will
> probably make the situation worse).
Thanks, we're going to look into that.
I checked our revision history to see if we changed any checkpoint
settings over the last year and we have not, however what I did notice
was that a few days before we first experienced this issue we
increased the wal_keep_segments from 256 to 1024 (and then later
further to 1536) in order to keep enough WAL records around for our
backups.
Sure enough, I just found a post at
http://www.postgresql.org/message-id/CAPVp=gbKVbNr1zQM_LKauNY-U1PHB++y=Xq26K-dXdDsffv_PQ@mail.gmail.com
describing a similar issue related to having wal_keep_segments set to
1024 (with a much lower checkpoint_segments set - 32) but no
resolution on the list.
Any thoughts on the wal_keep_segments we have set to 1024 currently?
> Another thing that I may be misremembering from from your earlier
> email: did you say that the load on the database was mostly write
> (or am I misremembering that you said the OS graphs were showing
> mostly write?) The reason I'm asking is that we've seen problems
> like you describe when trying to implement a high volume queue
> in PostgreSQL: the continuous INSERT/SELECT/DELETE grind on the
> single queue table was just more than PostgreSQL could keep up
> with. We moved that one portion of the application to Redis and
> everything else just fell in line. I'm stretching a bit to suppose
> that you have a similar problem, but it's another data point for
> you to consider.
Yes, mostly writes. We already use Redis for some aspects of the
site, but we'll look into what else we could move there.
Jaco
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Steve Kehlet | 2014-06-14 01:53:24 | Re: Spurious Stalls |
| Previous Message | Si Chen | 2014-06-13 21:54:19 | Re: what does pg_activity mean when the database is stuck? |