From: | Christopher Nielsen <cnielsen(at)atlassian(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Spurious Stalls |
Date: | 2014-06-12 19:57:24 |
Message-ID: | CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi Group,
My team has been very happy using Postgres, hosting Bitbucket
<http://bitbucket.org/>. Thanks very much for all the community
contributions, to the platform.
Lately, though, about once a day now, for about a week, we have been
experiencing periods of stalling. When Postgres stalls, we haven't been
able to recover, without restarting the database, unfortunately.
This brings our uptime down some, that we'd like to avoid (99.2%) :( We'd
like to do a better job keeping things running.
It would be great to get your input about it. Alternately, if someone is
available, as a consultant, that would be great too.
Here is some background, about the issue. We have found the following
symptoms.
- During this performance issue, we found the following symptoms.
- Running queries do not return.
- The application sometimes can no longer get new connections.
- The CPU load increases
- There is no I/O wait.
- There is no swapping.
Also, our database configuration, is attached to this email, as
postgresql.conf, for reference, along with a profile of our hardware and
tuning, as pg_db_profile.txt.
While the database was unavailable, we also collected a lot of data.
Looking through this info, a few things pop-out to us, that may be
problematic, or useful to notice.
- Disk I/O appears to be all write, and little read.
- In previous incidents, with the same symptoms, we have seen pg
processes spending much time in s_lock
- That info is attached to this email also, as files named perf_*.
Additionally, monitoring graphs show the following performance profile.
*Problem*
As you can probably see below, at 11:54, the DB stops returning rows.
**
Also, transactions stop returning, causing the active transaction time to
trend up to the sky.
*Consequences of Problem*
Once transactions stop returning, we see connections pile-up. Eventually,
we reach a max, and clients can no longer connect.
The cpu utilization increases to nearly 100%, in user space, and stays
there, until the database is restarted.
*Events Before Problem*
This is likely the most useful part. As the time approaches 11:54, there
are periods of increased latency. There is also a marked increase in write
operations, in general.
Lastly, about 10 minutes before outage, postgres writes a sustained 30
MB/s of temp files.
After investigating this, we found a query that was greatly exceeding
work_mem. We've since optimized it, and hopefully, that will have a
positive effect on the above.
We may not know until the next issue happens, though.
With a problem like this, I am not exactly positive how to proceed. I am
really looking forward to hearing your thoughts, and opinions, if you can
share them.
Thanks very much,
-Chris
Attachment | Content-Type | Size |
---|---|---|
pg_db_profile.txt | text/plain | 11.2 KB |
postgresql.conf | application/octet-stream | 2.7 KB |
perf_example_vmstat | application/octet-stream | 1.8 KB |
perf_example_dmesg | application/octet-stream | 66.4 KB |
perf_example_ipcs | application/octet-stream | 837 bytes |
perf_example_locks.csv | text/csv | 70 bytes |
perf_example_pginfo | application/octet-stream | 59 bytes |
perf_example_ps_auxfww | application/octet-stream | 146.7 KB |
perf_example_iotop | application/octet-stream | 11.0 KB |
perf_example_strace.47700 | application/octet-stream | 37 bytes |
perf_example_backtrace.47700 | application/octet-stream | 419 bytes |
perf_example_stack.47700 | application/octet-stream | 316 bytes |
perf_example_status.47700 | application/octet-stream | 926 bytes |
perf_example_strace.46462 | application/octet-stream | 2.9 MB |
perf_example_syscall.47700 | application/octet-stream | 65 bytes |
perf_example_backtrace.46462 | application/octet-stream | 1.7 KB |
perf_example_stack.46462 | application/octet-stream | 40 bytes |
perf_example_status.46462 | application/octet-stream | 925 bytes |
perf_example_strace.29561 | application/octet-stream | 5.1 MB |
perf_example_syscall.46462 | application/octet-stream | 8 bytes |
perf_example_backtrace.29561 | application/octet-stream | 419 bytes |
perf_example_stack.29561 | application/octet-stream | 316 bytes |
perf_example_status.29561 | application/octet-stream | 927 bytes |
perf_example_syscall.29561 | application/octet-stream | 65 bytes |
perf_example_strace.81372 | application/octet-stream | 411.0 KB |
perf_example_backtrace.81372 | application/octet-stream | 290 bytes |
perf_example_stack.81372 | application/octet-stream | 280 bytes |
perf_example_status.81372 | application/octet-stream | 918 bytes |
perf_example_syscall.81372 | application/octet-stream | 83 bytes |
perf_example_vacuum | application/octet-stream | 12.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Torsten Förtsch | 2014-06-12 20:15:59 | locking order |
Previous Message | Merlin Moncure | 2014-06-12 19:51:05 | Re: max_connections reached in postgres 9.3.3 |