Spurious Stalls

From: Christopher Nielsen <cnielsen(at)atlassian(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Spurious Stalls
Date: 2014-06-12 19:57:24
Message-ID: CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Group,

My team has been very happy using Postgres, hosting Bitbucket
<http://bitbucket.org/>. Thanks very much for all the community
contributions, to the platform.

Lately, though, about once a day now, for about a week, we have been
experiencing periods of stalling. When Postgres stalls, we haven't been
able to recover, without restarting the database, unfortunately.

This brings our uptime down some, that we'd like to avoid (99.2%) :( We'd
like to do a better job keeping things running.

It would be great to get your input about it. Alternately, if someone is
available, as a consultant, that would be great too.

Here is some background, about the issue. We have found the following
symptoms.

- During this performance issue, we found the following symptoms.
- Running queries do not return.
- The application sometimes can no longer get new connections.
- The CPU load increases
- There is no I/O wait.
- There is no swapping.

Also, our database configuration, is attached to this email, as
postgresql.conf, for reference, along with a profile of our hardware and
tuning, as pg_db_profile.txt.

While the database was unavailable, we also collected a lot of data.
Looking through this info, a few things pop-out to us, that may be
problematic, or useful to notice.

- Disk I/O appears to be all write, and little read.
- In previous incidents, with the same symptoms, we have seen pg
processes spending much time in s_lock
- That info is attached to this email also, as files named perf_*.

Additionally, monitoring graphs show the following performance profile.

*Problem*

As you can probably see below, at 11:54, the DB stops returning rows.

*​*
Also, transactions stop returning, causing the active transaction time to
trend up to the sky.

*Consequences of Problem*

Once transactions stop returning, we see connections pile-up. Eventually,
we reach a max, and clients can no longer connect.


The cpu utilization increases to nearly 100%, in user space, and stays
there, until the database is restarted.

*Events Before Problem*

This is likely the most useful part. As the time approaches 11:54, there
are periods of increased latency. There is also a marked increase in write
operations, in general.
Lastly, about 10 minutes before outage, postgres writes a sustained 30
MB/s of temp files.

After investigating this, we found a query that was greatly exceeding
work_mem. We've since optimized it, and hopefully, that will have a
positive effect on the above.

We may not know until the next issue happens, though.

With a problem like this, I am not exactly positive how to proceed. I am
really looking forward to hearing your thoughts, and opinions, if you can
share them.

Thanks very much,

-Chris

Attachment Content-Type Size
pg_db_profile.txt text/plain 11.2 KB
postgresql.conf application/octet-stream 2.7 KB
perf_example_vmstat application/octet-stream 1.8 KB
perf_example_dmesg application/octet-stream 66.4 KB
perf_example_ipcs application/octet-stream 837 bytes
perf_example_locks.csv text/csv 70 bytes
perf_example_pginfo application/octet-stream 59 bytes
perf_example_ps_auxfww application/octet-stream 146.7 KB
perf_example_iotop application/octet-stream 11.0 KB
perf_example_strace.47700 application/octet-stream 37 bytes
perf_example_backtrace.47700 application/octet-stream 419 bytes
perf_example_stack.47700 application/octet-stream 316 bytes
perf_example_status.47700 application/octet-stream 926 bytes
perf_example_strace.46462 application/octet-stream 2.9 MB
perf_example_syscall.47700 application/octet-stream 65 bytes
perf_example_backtrace.46462 application/octet-stream 1.7 KB
perf_example_stack.46462 application/octet-stream 40 bytes
perf_example_status.46462 application/octet-stream 925 bytes
perf_example_strace.29561 application/octet-stream 5.1 MB
perf_example_syscall.46462 application/octet-stream 8 bytes
perf_example_backtrace.29561 application/octet-stream 419 bytes
perf_example_stack.29561 application/octet-stream 316 bytes
perf_example_status.29561 application/octet-stream 927 bytes
perf_example_syscall.29561 application/octet-stream 65 bytes
perf_example_strace.81372 application/octet-stream 411.0 KB
perf_example_backtrace.81372 application/octet-stream 290 bytes
perf_example_stack.81372 application/octet-stream 280 bytes
perf_example_status.81372 application/octet-stream 918 bytes
perf_example_syscall.81372 application/octet-stream 83 bytes
perf_example_vacuum application/octet-stream 12.6 KB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Torsten Förtsch 2014-06-12 20:15:59 locking order
Previous Message Merlin Moncure 2014-06-12 19:51:05 Re: max_connections reached in postgres 9.3.3