Intermittent hangs with 9.2

From: David Whittaker <dave(at)iradix(dot)com>
To: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Intermittent hangs with 9.2
Date: 2013-09-10 18:05:41
Message-ID: CABXnLXQ7wROP2+2Nkfjt=0Mgnhws8WpmiP37HqgCyjmg3cAHHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Ken,

On Tue, Sep 10, 2013 at 11:33 AM, ktm(at)rice(dot)edu <ktm(at)rice(dot)edu> wrote:

> On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
> > Hi All,
> >
> > I've been seeing a strange issue with our Postgres install for about a
> year
> > now, and I was hoping someone might be able to help point me at the
> cause.
> > At what seem like fairly random intervals Postgres will become
> unresponsive
> > to the 3 application nodes it services. These periods tend to last for
> 10 -
> > 15 minutes before everything rights itself and the system goes back to
> > normal.
> >
> > During these periods the server will report a spike in the outbound
> > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> > context switches / interrupts (normal peaks are around 2k/8k
> respectively,
> > and during these periods they‘ve gone to 15k/22k), and a load average of
> > 100+. CPU usage stays relatively low, but it’s all system time reported,
> > user time goes to zero. It doesn‘t seem to be disk related since we’re
> > running with a shared_buffers setting of 24G, which will fit just about
> our
> > entire database into memory, and the IO transactions reported by the
> > server, as well as the disk reads reported by Postgres stay consistently
> > low.
> >
> > We‘ve recently started tracking how long statements take to execute, and
> > we’re seeing some really odd numbers. A simple delete by primary key, for
> > example, from a table that contains about 280,000 rows, reportedly took
> > 18h59m46.900s. An update by primary key in that same table was reported
> as
> > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously
> those
> > numbers don't seem reasonable at all.
> >
> > Some other changes we've made to postgresql.conf:
> >
> > synchronous_commit = off
> >
> > maintenance_work_mem = 1GB
> > wal_level = hot_standby
> > wal_buffers = 16MB
> >
> > max_wal_senders = 10
> >
> > wal_keep_segments = 5000
> >
> > checkpoint_segments = 128
> >
> > checkpoint_timeout = 30min
> >
> > checkpoint_completion_target = 0.9
> >
> > max_connections = 500
> >
> > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> > RAM, running Cent OS 6.3.
> >
> > So far we‘ve tried disabling Transparent Huge Pages after I found a
> number
> > of resources online that indicated similar interrupt/context switch
> issues,
> > but it hasn’t resolve the problem. I managed to catch it happening once
> and
> > run a perf which showed:
> >
> >
> > + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9
> > + 9.55% 10956 postmaster 0x2dc820 f
> > set_config_option
> > + 8.64% 9946 postmaster 0x5a3d4 f writeListPage
> > + 5.75% 6609 postmaster 0x5a2b0 f
> > ginHeapTupleFastCollect
> > + 2.68% 3084 postmaster 0x192483 f
> > build_implied_join_equality
> > + 2.61% 2990 postmaster 0x187a55 f
> > build_paths_for_OR
> > + 1.86% 2131 postmaster 0x794aa f
> > get_collation_oid
> > + 1.56% 1822 postmaster 0x5a67e f
> > ginHeapTupleFastInsert
> > + 1.53% 1766 postmaster 0x1929bc f
> > distribute_qual_to_rels
> > + 1.33% 1558 postmaster 0x249671 f cmp_numerics
> >
> > I‘m not sure what 0x347ba9 represents, or why it’s an address rather
> than a
> > method name.
> >
> > That's about the sum of it. Any help would be greatly appreciated and if
> > you want any more information about our setup, please feel free to ask.
> >
> > Thanks,
> > Dave
>
> Hi Dave,
>
> A load average of 100+ means that you have that many processes waiting to
> run yet you only have 16 cpus. You really need to consider using a
> connection
> pooler like pgbouncer to keep your connection count in the 16-32 range.
>
>
That would make sense if the issues corresponded to increased load, but
they don't. I understand that the load spike is caused by waiting
processed, but it doesn't seem to correspond to a transaction spike. The
number of transactions per second appear to stay in-line with normal usage
when these issues occur. I do see an increase in postmaster processes when
it happens, but they don't seem to have entered a transaction yet. Coupled
with the fact that cpu usage is all system time, and the context switch /
interrupt spikes, I feel like something must be going on behind the scenes
leading to these problems. I'm just not sure what that something is.

> Regards,
> Ken
>

Browse pgsql-performance by date

  From Date Subject
Next Message David Whittaker 2013-09-10 18:06:26 Intermittent hangs with 9.2
Previous Message David Whittaker 2013-09-10 18:04:57 Intermittent hangs with 9.2