Intermittent hangs with 9.2

From: David Whittaker <dave(at)iradix(dot)com>
To: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Intermittent hangs with 9.2
Date: 2013-09-10 18:04:57
Message-ID: CABXnLXTah41DNzue_zpo_PGp4GhD3=n1Y4a7wCy-kMqxY55OEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Andrew,

On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan <andrew(at)dunslane(dot)net>wrote:

>
> On 09/10/2013 11:04 AM, David Whittaker wrote:
>
>>
>> Hi All,
>>
>> I've been seeing a strange issue with our Postgres install for about a
>> year now, and I was hoping someone might be able to help point me at the
>> cause. At what seem like fairly random intervals Postgres will become
>> unresponsive to the 3 application nodes it services. These periods tend to
>> last for 10 - 15 minutes before everything rights itself and the system
>> goes back to normal.
>>
>> During these periods the server will report a spike in the outbound
>> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
>> context switches / interrupts (normal peaks are around 2k/8k respectively,
>> and during these periods they‘ve gone to 15k/22k), and a load average of
>> 100+. CPU usage stays relatively low, but it’s all system time reported,
>> user time goes to zero. It doesn‘t seem to be disk related since we’re
>> running with a shared_buffers setting of 24G, which will fit just about our
>> entire database into memory, and the IO transactions reported by the
>> server, as well as the disk reads reported by Postgres stay consistently
>> low.
>>
>> We‘ve recently started tracking how long statements take to execute, and
>> we’re seeing some really odd numbers. A simple delete by primary key, for
>> example, from a table that contains about 280,000 rows, reportedly took
>> 18h59m46.900s. An update by primary key in that same table was reported as
>> 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
>> numbers don't seem reasonable at all.
>>
>> Some other changes we've made to postgresql.conf:
>>
>> synchronous_commit = off
>>
>> maintenance_work_mem = 1GB
>> wal_level = hot_standby
>> wal_buffers = 16MB
>>
>> max_wal_senders = 10
>>
>> wal_keep_segments = 5000
>>
>> checkpoint_segments = 128
>>
>> checkpoint_timeout = 30min
>>
>> checkpoint_completion_target = 0.9
>>
>> max_connections = 500
>>
>> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
>> RAM, running Cent OS 6.3.
>>
>> So far we‘ve tried disabling Transparent Huge Pages after I found a
>> number of resources online that indicated similar interrupt/context switch
>> issues, but it hasn’t resolve the problem. I managed to catch it happening
>> once and run a perf which showed:
>>
>> |
>> + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9
>> + 9.55% 10956 postmaster 0x2dc820 f set_config_option
>> + 8.64% 9946 postmaster 0x5a3d4 f writeListPage
>> + 5.75% 6609 postmaster 0x5a2b0 f
>> ginHeapTupleFastCollect
>> + 2.68% 3084 postmaster 0x192483 f
>> build_implied_join_equality
>> + 2.61% 2990 postmaster 0x187a55 f build_paths_for_OR
>> + 1.86% 2131 postmaster 0x794aa f get_collation_oid
>> + 1.56% 1822 postmaster 0x5a67e f
>> ginHeapTupleFastInsert
>> + 1.53% 1766 postmaster 0x1929bc f
>> distribute_qual_to_rels
>> + 1.33% 1558 postmaster 0x249671 f cmp_numerics|
>>
>> I‘m not sure what 0x347ba9 represents, or why it’s an address rather than
>> a method name.
>>
>> That's about the sum of it. Any help would be greatly appreciated and if
>> you want any more information about our setup, please feel free to ask.
>>
>>
>>
> I have seen cases like this with very high shared_buffers settings.
>
> 24Gb for shared_buffers is quite high, especially on a 48Gb box. What
> happens if you dial that back to, say, 12Gb?
>

I'd be willing to give it a try. I'd really like to understand what's
going on here though. Can you elaborate on that? Why would 24G of shared
buffers be too high in this case? The machine is devoted entirely to PG,
so having PG use half of the available RAM to cache data doesn't feel
unreasonable.

>
> cheers
>
> andrew
>
>

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David Whittaker 2013-09-10 18:05:41 Intermittent hangs with 9.2
Previous Message Jeff Janes 2013-09-10 17:44:37 Re: Intermittent hangs with 9.2