Re: max_connections reached in postgres 9.3.3

From: Borislav Ivanov <bivanov(at)atlassian(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, "Vasudevan, Ramya" <ramya(dot)vasudevan(at)classmates(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: max_connections reached in postgres 9.3.3
Date: 2014-06-19 22:12:39
Message-ID: CAEqRsYWcg5QdYwrffCAsZJ85BCcfSktr1MSEEnG9SysQ4VXB_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Ramya,

We experience exactly the same problem here at Bitbucket. From what I can
tell the major difference between your setup and ours is that you run 9.3.3
and we run 9.2.8. Our post for the issues is at
http://www.postgresql.org/message-id/CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com.
We have shared collected information from the incident.

We have tried quite a few things and none of them have solved the problem
yet. Things we tried:
- Upgraded CentOS 6.4 to 6.5
- Upgrade kernel multiple times (current 2.6.32-431.17.1 )
- Changed IO scheduler from CFQ to deadline (front_merges = 0, read_expire
= 150, write_expire = 1500) - ext4 for both PGDATA and pg_xlog volumes
- Switched from ext4 to XFS for both PGDATA and pg_xlog (We thought we were
hitting the ext4 freeze bug)
- Upgraded Postgres from 9.2.5 to .6 then to .8
- Lowered max_connections from 3000 to 500
- Lowered pgbouncer default_pool_size
- Switched pgbouncer from session to transaction mode (It was suitable for
our use)
- Multiple App improvements which resulted in lowering the returned tuples
from 10M to 1.5M
- Fixed all slow queries.
- Truncated tables bigger than 25GB data, 25GB index
- Tried shared_buffers with 4GB, 6GB, 8GB (current 4GB)
- Tried work_mem with multiple values (current 384MB)
- Tried wal_buffers with 4MB and 8MB (current 8MB)
- Tried different approaches about checkpoints. None of them made a
difference for the actual problem.
- We've been always running with synchronous_commit = on
- We have 4 streaming replicas
- We do archive from the master using rsync in archive_command
- Switched the master to a different machine to triple-confirm no hardware
issues

Next we are adding a second layer for pgbouncer on the database itself. Our
goal is to pass less connections to Postgres than the number of cores we
have. Many are suggesting that we lower our max_connections and this thread
also says the same. So we are going to try that.

However, most people on our team think that the number of connections is
purely a symptom of the actual problem. We would love to be wrong about
this. But for now we feel the high number of connections contributes for
preserving the problem but it's not actually triggering the problem.

I am including some data from when the DB freezes at 400 connections and
another data set of the DB slowing down due to a load test with 500
connections where the performance is quite degraded but Postgres didn't
hang. I am also willing to provide more data on our side as needed.

# vmstat -S M 1 20 from when the DB freezes. Site is down. We are maxed
out on conns. This incident is the same as what you see. In this case our
Postgres cannot recover. The only fix we have so far is to restart Postgres.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----

r b swpd free buff cache si so bi bo in cs us sy id
wa st

51 0 0 123717 152 121785 0 0 4 34 1 1 20 3 77
0 0

228 0 0 123702 152 121786 0 0 0 720 134874 295787 79
11 10 0 0

232 0 0 123700 152 121786 0 0 0 7128 138700 314263 79
12 9 0 0

274 0 0 123691 152 121786 0 0 0 980 133590 245257 81
11 8 0 0

380 0 0 123789 152 121786 0 0 0 1008 133792 258339 80
11 9 0 0
142 0 0 123839 152 121786 0 0 0 1328 139243 303489 78
12 10 0 0

# vmstat -S M 1 20 from when the DB is stressed with real production load.
The site is slow but up. Performance is degraded but the machine and
Postgres can recover from this situation. Note that the machine was freshly
rebooted before running this test.

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----

r b swpd free buff cache si so bi bo in cs us sy id
wa st

156 10 0 232743 45 16749 0 0 42 2 80 80 3 0
96 1 0

260 11 0 232639 45 16828 0 0 76704 2792 155594 223928 83
12 4 0 0

128 12 0 232552 45 16887 0 0 59536 2896 153932 229723 82
12 6 1 0

267 6 0 232480 45 16931 0 0 43144 2320 149501 200510 83
11 5 0 0

105 8 0 232408 45 16965 0 0 38016 2208 148905 185142 84
11 5 0 0

112 1 0 232339 45 17000 0 0 32640 2528 148390 183620 83
11 6 0 0

Again, if you want to see more of the data we've collected, see our post
http://www.postgresql.org/message-id/CAJ+wzrb1qhz3xuoeSy5mo8i=E-5OO9Yvm6R+VxLBGaPB=uevqA@mail.gmail.com.
We have shared collected information from the incident.

I'll also keep you updated with any new things we find and also how
lowering the potential connections to our Postgres goes. We don't have
plans of testing on CentOS 5.8.

On 19 June 2014 12:56, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Thu, Jun 19, 2014 at 2:35 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> > "Vasudevan, Ramya" <ramya(dot)vasudevan(at)classmates(dot)com> wrote:
> >
> >> On the waiting queries - When we reached 1500 connections, we
> >> had 759 connections that were in active state (116 COMMIT, 238
> >> INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active
> >> INSERTS and UPDATES also includes the 80-90 waiting sessions (We
> >> checked pg_stat_activity for 'waiting' state. And pg_locks for
> >> granted=f). The blocking and the waiting queries were simple one
> >> row updates/inserts/deletes. These shouldn’t be blocking each
> >> other normally (unless, we think, there was a problem writing to
> >> the disk). Correct me if I am wrong.
> >
> > You may want to consider this:
> >
> > http://wiki.postgresql.org/wiki/Number_Of_Database_Connections
> >
> >> [vmstat show up to 173111 context switches per second, with high
> >> cs rate corresponding to user CPU% between 64 and 82]
> >
> > That usually means there is high contention for spinlocks,
> > potentially with processes getting suspended while holding
> > spinlocks, making things worse.
>
> Completely agree on both diagnosis and proposed solution -- load
> profile (low iowait, high user%, high processes, high CS) is
> symptomatic of too many processes trying to do things at once. So
> there may be some application caused driver of the problem or you are
> hitting a contention point within postgres (a 'perf top' might give
> clues to the latter). Either way, once you are in this state you end
> up with too many processes fighting for cpu and cache lines which
> exaggerates the problem to the point you can classify it as an outage.
>
> Be advised transaction mode pooling makes certain features of the
> database difficult or impossible to use -- advisory locks (except xact
> variants), server side prepared statements, asynchronous
> notificiation, WITH HOLD cursors and the like -- basically anything
> scoped to the session. For many workloads it is a high win though.
> If for whatever reason this solution doesn't work, your other options
> are to try to optimize whatever is causing the load event
> (particularly if it's in your code -- careful query logging might give
> some clues) or to simply upgrade hardware (more/faster cpu especially
> for your case) to the point that even when highly loaded you're always
> clearing queries at an acceptable rate. The hardware approach has
> some risk though -- if you have a contention problem it's not always a
> given that adding cores will scale as well as you think. Faster
> core/bus is almost always a win, but obviously there's a very strict
> limit you can go.
>
> merlin
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Borislav Ivanov 2014-06-19 22:47:57 Re: Best backup strategy for production systems
Previous Message Tom Lane 2014-06-19 21:05:19 Re: \COPY from CSV ERROR: unterminated CSV quoted field