Re: Intermittent hangs with 9.2

From: David Whittaker <dave(at)iradix(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Intermittent hangs with 9.2
Date: 2013-09-12 20:06:22
Message-ID: CABXnLXT-HF7amXEuS_jTCKiYGfv+t-vodptOzQzjoWQZ9ohjMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi All,

We lowered shared_buffers to 8G and increased effective_cache_size
accordingly. So far, we haven't seen any issues since the adjustment. The
issues have come and gone in the past, so I'm not convinced it won't crop
up again, but I think the best course is to wait a week or so and see how
things work out before we make any other changes.

Thank you all for your help, and if the problem does reoccur, we'll look
into the other options suggested, like using a patched postmaster and
compiling for perf -g.

Thanks again, I really appreciate the feedback from everyone.

-Dave

On Wed, Sep 11, 2013 at 1:17 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
> > > I've been seeing a strange issue with our Postgres install for about a
> year
> > > now, and I was hoping someone might be able to help point me at the
> cause.
> > > At what seem like fairly random intervals Postgres will become
> unresponsive
> > > to the 3 application nodes it services. These periods tend to last for
> 10 -
> > > 15 minutes before everything rights itself and the system goes back to
> > > normal.
> > >
> > > During these periods the server will report a spike in the outbound
> > > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike
> in
> > > context switches / interrupts (normal peaks are around 2k/8k
> respectively,
> > > and during these periods they‘ve gone to 15k/22k), and a load average
> of
> > > 100+. CPU usage stays relatively low, but it’s all system time
> reported,
> > > user time goes to zero. It doesn‘t seem to be disk related since we’re
> > > running with a shared_buffers setting of 24G, which will fit just
> about our
> > > entire database into memory, and the IO transactions reported by the
> server,
> > > as well as the disk reads reported by Postgres stay consistently low.
> > >
> > > We‘ve recently started tracking how long statements take to execute,
> and
> > > we’re seeing some really odd numbers. A simple delete by primary key,
> for
> > > example, from a table that contains about 280,000 rows, reportedly took
> > > 18h59m46.900s. An update by primary key in that same table was
> reported as
> > > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously
> those
> > > numbers don't seem reasonable at all.
> > >
> > > Some other changes we've made to postgresql.conf:
> > >
> > > synchronous_commit = off
> > >
> > > maintenance_work_mem = 1GB
> > > wal_level = hot_standby
> > > wal_buffers = 16MB
> > >
> > > max_wal_senders = 10
> > >
> > > wal_keep_segments = 5000
> > >
> > > checkpoint_segments = 128
> > >
> > > checkpoint_timeout = 30min
> > >
> > > checkpoint_completion_target = 0.9
> > >
> > > max_connections = 500
> > >
> > > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB
> of
> > > RAM, running Cent OS 6.3.
> > >
> > > So far we‘ve tried disabling Transparent Huge Pages after I found a
> number
> > > of resources online that indicated similar interrupt/context switch
> issues,
> > > but it hasn’t resolve the problem. I managed to catch it happening
> once and
> > > run a perf which showed:
> > >
> > > + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9
> > > + 9.55% 10956 postmaster 0x2dc820 f set_config_option
> > > + 8.64% 9946 postmaster 0x5a3d4 f writeListPage
> > > + 5.75% 6609 postmaster 0x5a2b0 f
> > > ginHeapTupleFastCollect
> > > + 2.68% 3084 postmaster 0x192483 f
> > > build_implied_join_equality
> > > + 2.61% 2990 postmaster 0x187a55 f
> build_paths_for_OR
> > > + 1.86% 2131 postmaster 0x794aa f get_collation_oid
> > > + 1.56% 1822 postmaster 0x5a67e f
> ginHeapTupleFastInsert
> > > + 1.53% 1766 postmaster 0x1929bc f
> > > distribute_qual_to_rels
> > > + 1.33% 1558 postmaster 0x249671 f cmp_numerics
> > >
> > > I‘m not sure what 0x347ba9 represents, or why it’s an address rather
> than a
> > > method name.
>
> Try converting it to something more meaningful with "addr2line", that
> often has more sucess.
>
> > > That's about the sum of it. Any help would be greatly appreciated and
> if you
> > > want any more information about our setup, please feel free to ask.
>
> > Reducing shared buffers to around 2gb will probably make the problem go
> away
>
> That profile doesn't really look like one of the problem you are
> referring to would look like.
>
> Based on the profile I'd guess it's possible that you're seing problems
> with GIN's "fastupdate" mechanism.
> Try ALTER INDEX whatever SET (FASTUPDATE = OFF); VACUUM
> whatever's_table for all gin indexes.
>
> It's curious that set_config_option is so high in the profile... Any
> chance you could recompile postgres with -fno-omit-frame-pointers in
> CFLAGS? That would allow you to use perf -g. The performance price of
> that usually is below 1% for postgres.
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2013-09-13 14:52:16 Re: Intermittent hangs with 9.2
Previous Message Kevin Grittner 2013-09-12 18:46:20 Re: How clustering for scale out works in PostgreSQL