Re: Intermittent hangs with 9.2

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: David Whittaker <dave(at)iradix(dot)com>
Cc: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Intermittent hangs with 9.2
Date: 2013-09-10 15:26:30
Message-ID: 522F3A26.20306@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 09/10/2013 11:04 AM, David Whittaker wrote:
>
> Hi All,
>
> I've been seeing a strange issue with our Postgres install for about a
> year now, and I was hoping someone might be able to help point me at
> the cause. At what seem like fairly random intervals Postgres will
> become unresponsive to the 3 application nodes it services. These
> periods tend to last for 10 - 15 minutes before everything rights
> itself and the system goes back to normal.
>
> During these periods the server will report a spike in the outbound
> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike
> in context switches / interrupts (normal peaks are around 2k/8k
> respectively, and during these periods they‘ve gone to 15k/22k), and a
> load average of 100+. CPU usage stays relatively low, but it’s all
> system time reported, user time goes to zero. It doesn‘t seem to be
> disk related since we’re running with a shared_buffers setting of 24G,
> which will fit just about our entire database into memory, and the IO
> transactions reported by the server, as well as the disk reads
> reported by Postgres stay consistently low.
>
> We‘ve recently started tracking how long statements take to execute,
> and we’re seeing some really odd numbers. A simple delete by primary
> key, for example, from a table that contains about 280,000 rows,
> reportedly took 18h59m46.900s. An update by primary key in that same
> table was reported as 7d 17h 58m 30.415s. That table is frequently
> accessed, but obviously those numbers don't seem reasonable at all.
>
> Some other changes we've made to postgresql.conf:
>
> synchronous_commit = off
>
> maintenance_work_mem = 1GB
> wal_level = hot_standby
> wal_buffers = 16MB
>
> max_wal_senders = 10
>
> wal_keep_segments = 5000
>
> checkpoint_segments = 128
>
> checkpoint_timeout = 30min
>
> checkpoint_completion_target = 0.9
>
> max_connections = 500
>
> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB
> of RAM, running Cent OS 6.3.
>
> So far we‘ve tried disabling Transparent Huge Pages after I found a
> number of resources online that indicated similar interrupt/context
> switch issues, but it hasn’t resolve the problem. I managed to catch
> it happening once and run a perf which showed:
>
> |
> + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9
> + 9.55% 10956 postmaster 0x2dc820 f set_config_option
> + 8.64% 9946 postmaster 0x5a3d4 f writeListPage
> + 5.75% 6609 postmaster 0x5a2b0 f ginHeapTupleFastCollect
> + 2.68% 3084 postmaster 0x192483 f build_implied_join_equality
> + 2.61% 2990 postmaster 0x187a55 f build_paths_for_OR
> + 1.86% 2131 postmaster 0x794aa f get_collation_oid
> + 1.56% 1822 postmaster 0x5a67e f ginHeapTupleFastInsert
> + 1.53% 1766 postmaster 0x1929bc f distribute_qual_to_rels
> + 1.33% 1558 postmaster 0x249671 f cmp_numerics|
>
> I‘m not sure what 0x347ba9 represents, or why it’s an address rather
> than a method name.
>
> That's about the sum of it. Any help would be greatly appreciated and
> if you want any more information about our setup, please feel free to ask.
>
>

I have seen cases like this with very high shared_buffers settings.

24Gb for shared_buffers is quite high, especially on a 48Gb box. What
happens if you dial that back to, say, 12Gb?

cheers

andrew

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message ktm@rice.edu 2013-09-10 15:33:48 Re: Intermittent hangs with 9.2
Previous Message David Whittaker 2013-09-10 15:04:21 Intermittent hangs with 9.2