Re: max_connections reached in postgres 9.3.3

From: "Vasudevan, Ramya" <ramya(dot)vasudevan(at)classmates(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: max_connections reached in postgres 9.3.3
Date: 2014-06-18 22:08:11
Message-ID: 20EE50F73664E744AF948F0106FE6DFA2AD1101B@SEAMBX01.sea.corp.int.untd.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Merlin, Thank you for the response.

On the waiting queries - When we reached 1500 connections, we had 759 connections that were in active state (116 COMMIT, 238 INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active INSERTS and UPDATES also includes the 80-90 waiting sessions (We checked pg_stat_activity for 'waiting' state. And pg_locks for granted=f). The blocking and the waiting queries were simple one row updates/inserts/deletes. These shouldn’t be blocking each other normally (unless, we think, there was a problem writing to the disk). Correct me if I am wrong.

1) log_min_duration_statement >>
We have it set to 200ms and capturing all slow queries.

2) detailed system performance diagnostics during load event. We need to know precise values for user, system, iowait >>
First occurance of 'max_connections reached' problem: It seems to have started about 10:40.

At 10:42, we ran out of connections:
Jun 17 10:42:21 FATAL: remaining connection slots are reserved for non-replication superuser connections

AT 10:48:
We dropped the disk cache (used http://www.evanhoffman.com/evan/2012/09/25/problems-with-postgresql-9-0-on-centos-6-3-with-a-2-6-32-series-kernel/ as reference). However, that didn’t help.

vmstat information (with user cpu, system cpu, i/dle % and i/o wait):
-----------------------------------------------------------------------------------
At 10:40:19, user cpu started climbing, and cs (context switches) and in (interrupts) climbed up as well. No change really in swap, bi (bytes in), bo (bytes out)
procs -----------memory---------- ---swap-- -----io---- --procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ ---timestamp---
r b swpd free buff cache si so bi bo in cs us sy id wa st
7 0 0 939888 145808 774588544 0 0 1524 772 21177 32761 26 6 67 1 0 2014-06-17 10:40:06 PDT
14 0 0 951436 145816 774589952 0 0 1128 1396 22584 42339 34 6 59 1 0 2014-06-17 10:40:07 PDT
12 1 0 916132 145816 774591360 0 0 2036 1956 23132 35945 27 6 65 2 0 2014-06-17 10:40:08 PDT
7 0 0 925444 145816 774593920 0 0 1604 3156 23722 78796 35 8 55 1 0 2014-06-17 10:40:09 PDT
15 2 0 946200 145816 774595840 0 0 1420 460 24051 56856 35 8 57 1 0 2014-06-17 10:40:10 PDT
6 2 0 979356 145816 774597376 0 0 1708 5016 23379 33239 31 6 61 1 0 2014-06-17 10:40:11 PDT
6 0 0 1005784 145816 774615616 0 0 1980 6792 23584 36025 30 6 62 2 0 2014-06-17 10:40:12 PDT
9 1 0 1004352 145816 774618880 0 0 1760 9784 24826 63949 35 7 57 2 0 2014-06-17 10:40:13 PDT
18 0 0 980156 145816 774621568 0 0 1260 9336 23897 69468 35 8 56 1 0 2014-06-17 10:40:14 PDT
4 1 0 960672 145816 774639168 0 0 1516 6420 23235 38023 33 7 58 1 0 2014-06-17 10:40:15 PDT
19 0 0 944560 145816 774642048 0 0 1880 6940 23337 39658 32 6 60 2 0 2014-06-17 10:40:16 PDT
16 1 0 897036 145816 774661440 0 0 2072 6024 26300 52597 39 8 53 1 0 2014-06-17 10:40:17 PDT
16 5 0 880428 145816 774664192 0 0 1460 12652 22617 28171 32 5 62 1 0 2014-06-17 10:40:18 PDT
11 0 0 885412 145816 774666112 0 0 1840 7884 25048 45393 39 7 53 1 0 2014-06-17 10:40:19 PDT
* interrupts start climbing about here (in and cs columns)
24 1 0 900336 145816 774685120 0 0 2248 4804 26126 42707 42 7 50 1 0 2014-06-17 10:40:20 PDT
14 2 0 895120 145816 774687744 0 0 1856 22660 27721 57019 46 8 45 1 0 2014-06-17 10:40:21 PDT
27 0 0 881216 145816 774690496 0 0 1756 5872 27643 44114 43 7 48 2 0 2014-06-17 10:40:22 PDT
19 0 0 865320 145816 774692864 0 0 1704 25752 34434 132248 59 13 27 1 0 2014-06-17 10:40:23 PDT
8 0 0 868056 145816 774711680 0 0 1308 10960 30492 81550 52 10 37 2 0 2014-06-17 10:40:24 PDT
33 0 0 853888 145816 774713152 0 0 616 2960 31951 61918 68 10 21 0 0 2014-06-17 10:40:25 PDT
44 0 0 784172 145816 774714176 0 0 700 2292 40585 77163 77 15 8 0 0 2014-06-17 10:40:26 PDT
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ ---timestamp---
r b swpd free buff cache si so bi bo in cs us sy id wa st
32 0 0 613712 145816 774715008 0 0 768 1172 44675 108155 74 17 8 0 0 2014-06-17 10:40:27 PDT
68 0 0 564336 145764 774600320 0 0 960 4892 46838 145268 68 21 11 0 0 2014-06-17 10:40:28 PDT
49 1 0 565880 145732 774426560 0 0 1340 4112 47306 165764 68 21 11 0 0 2014-06-17 10:40:29 PDT
3 0 0 552808 145716 774312896 0 0 2072 1336 46749 160141 66 23 11 1 0 2014-06-17 10:40:30 PDT
26 2 0 549716 145704 774139328 0 0 1260 1212 47234 164933 67 22 10 0 0 2014-06-17 10:40:31 PDT
13 1 0 579216 145676 773907072 0 0 1116 1328 46831 173111 64 24 12 0 0 2014-06-17 10:40:32 PDT
44 0 0 563372 145640 773703680 0 0 1384 3120 49568 166285 68 23 9 0 0 2014-06-17 10:40:33 PDT
55 0 0 545040 145624 773587776 0 0 1056 3864 50042 167987 68 23 9 0 0 2014-06-17 10:40:34 PDT
86 2 0 572432 145572 773430144 0 0 944 1084 58868 140190 77 18 4 0 0 2014-06-17 10:40:35 PDT
63 0 0 555424 145548 773372864 0 0 312 616 61136 126323 82 16 3 0 0 2014-06-17 10:40:36 PDT
71 0 0 584556 145500 773255936 0 0 888 852 59944 128606 81 16 3 0 0 2014-06-17 10:40:37 PDT
45 1 0 575360 145500 773140224 0 0 796 1844 60107 128723 81 16 3 0 0 2014-06-17 10:40:38 PDT
37 0 0 560712 145472 772965120 0 0 1440 4796 57150 140963 76 19 5 0 0 2014-06-17 10:40:39 PDT
112 0 0 541944 145424 772732480 0 0 2340 1272 53304 154266 70 22 7 0 0 2014-06-17 10:40:40 PDT
55 1 0 548428 145368 772440000 0 0 852 816 51299 166797 71 22 7 0 0 2014-06-17 10:40:41 PDT
21 0 0 584668 145288 772177216 0 0 1232 992 49328 157423 70 23 7 0 0 2014-06-17 10:40:42 PDT
61 0 0 552688 145236 771940608 0 0 2224 5216 49301 165526 66 25 9 0 0 2014-06-17 10:40:43 PDT
29 1 0 564316 145216 771679104 0 0 1340 4832 48844 164812 68 24 8 0 0 2014-06-17 10:40:44 PDT
34 0 0 566816 145192 771444288 0 0 1156 776 49673 155265 71 22 7 0 0 2014-06-17 10:40:45 PDT
34 0 0 569152 145176 771268608 0 0 1232 33408 49762 138163 74 20 6 0 0 2014-06-17 10:40:46 PDT
32 1 0 582320 145144 771034752 0 0 712 596 47012 156240 69 22 9 0 0 2014-06-17 10:40:47 PDT

iostat information, consolidated from output of iostat -x to one minute intervals:
--------------------------------------------------------------------------------------------------
RPS: Reads in blocks per second
WPS: writes in blocks per second
BLK_R: Blocks read
BLK_W: Blocks written

This is our file system mount point information:
site and site2 - mounts holding the data files.
data - pg_xlog
archive - archive logs

In general, We do not see anything that stands out in terms of I/O leading up to the problem.

i/o information on all mount points:
site - RPS went up after the cache was cleared, expected. But, no pattern that I can see leading up to the issue:
20140617 10:36: site: RPS: 1433.91 WPS: 599.67 BLK_R: 1434.27 BLK_W: 1.08
20140617 10:37: site: RPS: 1364.76 WPS: 583.80 BLK_R: 1364.87 BLK_W: 0.88
20140617 10:38: site: RPS: 1392.00 WPS: 1412.67 BLK_R: 1392.07 BLK_W: 0.75
20140617 10:39: site: RPS: 1481.12 WPS: 664.47 BLK_R: 1481.20 BLK_W: 0.88
20140617 10:40: site: RPS: 1242.64 WPS: 2034.67 BLK_R: 1242.73 BLK_W: 1.08 (problem)
20140617 10:41: site: RPS: 758.62 WPS: 620.86 BLK_R: 758.73 BLK_W: 1.10
20140617 10:42: site: RPS: 1482.59 WPS: 1304.47 BLK_R: 1482.80 BLK_W: 0.80
20140617 10:43: site: RPS: 1699.67 WPS: 3165.47 BLK_R: 1699.67 BLK_W: 1.13
20140617 10:44: site: RPS: 1072.45 WPS: 1193.67 BLK_R: 1072.67 BLK_W: 1.05
20140617 10:45: site: RPS: 941.51 WPS: 594.07 BLK_R: 941.67 BLK_W: 0.82
20140617 10:46: site: RPS: 622.42 WPS: 358.07 BLK_R: 622.73 BLK_W: 0.82
20140617 10:47: site: RPS: 781.18 WPS: 2038.07 BLK_R: 781.13 BLK_W: 0.98
20140617 10:48: site: RPS: 1582.35 WPS: 730.27 BLK_R: 1582.40 BLK_W: 0.93 (cache cleared)
20140617 10:49: site: RPS: 7954.96 WPS: 1713.60 BLK_R: 7955.87 BLK_W: 1.10
20140617 10:50: site: RPS: 10837.16 WPS: 1065.13 BLK_R: 10839.73 BLK_W: 1.05
20140617 10:51: site: RPS: 27859.84 WPS: 846.22 BLK_R: 27863.53 BLK_W: 1.02
20140617 10:52: site: RPS: 23309.31 WPS: 1650.86 BLK_R: 23312.67 BLK_W: 1.37
20140617 10:53: site: RPS: 28450.93 WPS: 220.83 BLK_R: 28454.00 BLK_W: 0.78
20140617 10:54: site: RPS: 14312.11 WPS: 272.51 BLK_R: 14313.60 BLK_W: 1.18
20140617 10:55: site: RPS: 29515.76 WPS: 157.33 BLK_R: 29516.00 BLK_W: 0.58
20140617 10:56: site: RPS: 22423.51 WPS: 683.07 BLK_R: 22426.80 BLK_W: 0.73

site2 - WPS went down after the problem, and even further down after cache was cleared.
Definitely an effect not a cause.
DB was hung and so, fewer writes.
Again, no pattern that I can see leading up to the issue:

20140617 10:36: site2: RPS: 1199.19 WPS: 1366.67 BLK_R: 1199.47 BLK_W: 1.17
20140617 10:37: site2: RPS: 1241.60 WPS: 1450.93 BLK_R: 1241.60 BLK_W: 1.27
20140617 10:38: site2: RPS: 1277.89 WPS: 1306.53 BLK_R: 1278.13 BLK_W: 1.20
20140617 10:39: site2: RPS: 1254.00 WPS: 1174.40 BLK_R: 1254.00 BLK_W: 1.07
20140617 10:40: site2: RPS: 1106.41 WPS: 1206.39 BLK_R: 1106.53 BLK_W: 0.87 (problem)
20140617 10:41: site2: RPS: 360.63 WPS: 898.53 BLK_R: 360.67 BLK_W: 0.87
20140617 10:42: site2: RPS: 1290.44 WPS: 418.27 BLK_R: 1290.67 BLK_W: 0.93
20140617 10:43: site2: RPS: 1804.53 WPS: 1614.93 BLK_R: 1804.53 BLK_W: 1.27
20140617 10:44: site2: RPS: 815.21 WPS: 1029.73 BLK_R: 815.47 BLK_W: 0.73
20140617 10:45: site2: RPS: 731.70 WPS: 640.67 BLK_R: 732.00 BLK_W: 1.33
20140617 10:46: site2: RPS: 415.36 WPS: 362.80 BLK_R: 415.47 BLK_W: 1.13
20140617 10:47: site2: RPS: 461.93 WPS: 268.13 BLK_R: 462.27 BLK_W: 1.13
20140617 10:48: site2: RPS: 559.47 WPS: 175.07 BLK_R: 559.47 BLK_W: 1.07
20140617 10:49: site2: RPS: 578.12 WPS: 306.53 BLK_R: 578.27 BLK_W: 1.20
20140617 10:50: site2: RPS: 419.14 WPS: 155.60 BLK_R: 419.20 BLK_W: 1.10
20140617 10:51: site2: RPS: 407.66 WPS: 152.80 BLK_R: 407.73 BLK_W: 1.13
20140617 10:52: site2: RPS: 316.07 WPS: 204.62 BLK_R: 316.00 BLK_W: 1.07
20140617 10:53: site2: RPS: 4957.84 WPS: 122.80 BLK_R: 4958.00 BLK_W: 1.13
20140617 10:54: site2: RPS: 5731.85 WPS: 236.40 BLK_R: 5732.40 BLK_W: 0.90
20140617 10:55: site2: RPS: 69626.67 WPS: 0.67 BLK_R: 69626.67 BLK_W: 0.17 (cache cleared)
20140617 10:56: site2: RPS: 77967.18 WPS: 2.93 BLK_R: 77980.80 BLK_W: 0.27
20140617 10:57: site2: RPS: 92866.05 WPS: 12.79 BLK_R: 92880.93 BLK_W: 3.13
20140617 10:58: site2: RPS: 84538.00 WPS: 22.53 BLK_R: 84538.00 BLK_W: 4.20
20140617 10:59: site2: RPS: 37730.09 WPS: 72.26 BLK_R: 37737.33 BLK_W: 2.93
20140617 11:00: site2: RPS: 82512.67 WPS: 13.60 BLK_R: 82512.67 BLK_W: 2.27
20140617 11:01: site2: RPS: 78838.10 WPS: 23.33 BLK_R: 78850.67 BLK_W: 4.17

data - A minute after the issue, spurt in reads (could be a one-off).
Writes drop (effect, not cause).
After cache is cleared, reads from disk increase, writes decrease (expected)

20140617 10:11: data: RPS: 0.00 WPS: 1342.90 BLK_R: 0.00 BLK_W: 3.73
20140617 10:12: data: RPS: 0.00 WPS: 1187.73 BLK_R: 0.00 BLK_W: 4.57
20140617 10:13: data: RPS: 0.00 WPS: 1318.91 BLK_R: 0.00 BLK_W: 4.23
20140617 10:14: data: RPS: 0.00 WPS: 1350.00 BLK_R: 0.00 BLK_W: 4.17
20140617 10:15: data: RPS: 0.00 WPS: 1539.12 BLK_R: 0.00 BLK_W: 3.83
20140617 10:16: data: RPS: 0.00 WPS: 1255.47 BLK_R: 0.00 BLK_W: 4.47
20140617 10:17: data: RPS: 0.00 WPS: 1734.58 BLK_R: 0.00 BLK_W: 4.20
20140617 10:18: data: RPS: 0.00 WPS: 2082.13 BLK_R: 0.00 BLK_W: 4.13
20140617 10:19: data: RPS: 0.00 WPS: 1560.90 BLK_R: 0.00 BLK_W: 4.70
20140617 10:20: data: RPS: 0.00 WPS: 1413.73 BLK_R: 0.00 BLK_W: 3.57
20140617 10:21: data: RPS: 0.00 WPS: 1677.33 BLK_R: 0.00 BLK_W: 3.90
20140617 10:22: data: RPS: 0.27 WPS: 1454.13 BLK_R: 0.27 BLK_W: 3.97
20140617 10:23: data: RPS: 0.00 WPS: 1439.16 BLK_R: 0.00 BLK_W: 4.00
20140617 10:24: data: RPS: 0.00 WPS: 1323.47 BLK_R: 0.00 BLK_W: 4.13
20140617 10:25: data: RPS: 0.00 WPS: 1650.56 BLK_R: 0.00 BLK_W: 3.27
20140617 10:26: data: RPS: 0.00 WPS: 1361.20 BLK_R: 0.00 BLK_W: 3.87
20140617 10:27: data: RPS: 0.00 WPS: 1360.07 BLK_R: 0.00 BLK_W: 3.93
20140617 10:28: data: RPS: 0.00 WPS: 1260.13 BLK_R: 0.00 BLK_W: 3.97
20140617 10:29: data: RPS: 0.00 WPS: 1513.86 BLK_R: 0.00 BLK_W: 3.50
20140617 10:30: data: RPS: 0.00 WPS: 1499.73 BLK_R: 0.00 BLK_W: 3.90
20140617 10:31: data: RPS: 0.00 WPS: 1394.80 BLK_R: 0.00 BLK_W: 3.47
20140617 10:32: data: RPS: 0.00 WPS: 1690.27 BLK_R: 0.00 BLK_W: 3.90
20140617 10:33: data: RPS: 0.00 WPS: 1543.55 BLK_R: 0.00 BLK_W: 3.57
20140617 10:34: data: RPS: 0.00 WPS: 1457.47 BLK_R: 0.00 BLK_W: 3.80
20140617 10:35: data: RPS: 0.00 WPS: 1414.00 BLK_R: 0.00 BLK_W: 4.50
20140617 10:36: data: RPS: 0.00 WPS: 1557.14 BLK_R: 0.00 BLK_W: 4.23
20140617 10:37: data: RPS: 0.00 WPS: 1675.60 BLK_R: 0.00 BLK_W: 3.77
20140617 10:38: data: RPS: 0.00 WPS: 1473.56 BLK_R: 0.00 BLK_W: 4.57
20140617 10:39: data: RPS: 0.00 WPS: 1424.00 BLK_R: 0.00 BLK_W: 3.67
20140617 10:40: data: RPS: 0.00 WPS: 4113.20 BLK_R: 0.00 BLK_W: 4.13 (problem)

* (anomaly, RPS goes to 2.40 from 0.0. One-off?) *
20140617 10:41: data: RPS: 2.40 WPS: 425.04 BLK_R: 2.40 BLK_W: 3.90
20140617 10:42: data: RPS: 0.00 WPS: 1402.49 BLK_R: 0.00 BLK_W: 4.50
20140617 10:43: data: RPS: 0.00 WPS: 1619.47 BLK_R: 0.00 BLK_W: 3.47
20140617 10:44: data: RPS: 0.00 WPS: 715.64 BLK_R: 0.00 BLK_W: 3.53
20140617 10:45: data: RPS: 0.27 WPS: 547.02 BLK_R: 0.27 BLK_W: 3.53
20140617 10:46: data: RPS: 0.00 WPS: 163.31 BLK_R: 0.00 BLK_W: 4.00
20140617 10:47: data: RPS: 0.13 WPS: 815.87 BLK_R: 0.13 BLK_W: 4.30
20140617 10:48: data: RPS: 74.00 WPS: 663.07 BLK_R: 74.00 BLK_W: 3.90 (cache cleared)
20140617 10:49: data: RPS: 448.53 WPS: 628.63 BLK_R: 448.53 BLK_W: 4.33
20140617 10:50: data: RPS: 39.20 WPS: 208.50 BLK_R: 39.20 BLK_W: 4.03
20140617 10:51: data: RPS: 35.73 WPS: 228.12 BLK_R: 35.73 BLK_W: 4.13
20140617 10:52: data: RPS: 15.20 WPS: 148.78 BLK_R: 15.20 BLK_W: 3.63
20140617 10:53: data: RPS: 24.27 WPS: 162.37 BLK_R: 24.27 BLK_W: 3.40
20140617 10:54: data: RPS: 136.26 WPS: 139.74 BLK_R: 136.27 BLK_W: 1.90
20140617 10:55: data: RPS: 93.33 WPS: 2253.87 BLK_R: 93.33 BLK_W: 4.37
20140617 10:56: data: RPS: 20.27 WPS: 837.60 BLK_R: 20.27 BLK_W: 4.57
20140617 10:57: data: RPS: 11.46 WPS: 391.42 BLK_R: 11.47 BLK_W: 4.30
20140617 10:58: data: RPS: 21.07 WPS: 348.27 BLK_R: 21.07 BLK_W: 3.77
20140617 10:59: data: RPS: 65.83 WPS: 388.87 BLK_R: 65.87 BLK_W: 3.43
20140617 11:00: data: RPS: 45.87 WPS: 1159.20 BLK_R: 45.87 BLK_W: 3.37

archive -After the problem started, WPS dropped very low for some time.
Is that expected? Is it because wal logs were written at a lower pace, and so archiving had less to do?

20140617 10:34: archive: RPS: 0.00 WPS: 1648.27 BLK_R: 0.00 BLK_W: 0.93
20140617 10:35: archive: RPS: 0.00 WPS: 553.07 BLK_R: 0.00 BLK_W: 0.57
20140617 10:36: archive: RPS: 0.00 WPS: 1648.10 BLK_R: 0.00 BLK_W: 1.30
20140617 10:37: archive: RPS: 0.00 WPS: 1102.00 BLK_R: 0.00 BLK_W: 1.17
20140617 10:38: archive: RPS: 0.00 WPS: 1098.39 BLK_R: 0.00 BLK_W: 0.73
20140617 10:39: archive: RPS: 0.00 WPS: 1102.27 BLK_R: 0.00 BLK_W: 0.60
20140617 10:40: archive: RPS: 0.00 WPS: 3285.07 BLK_R: 0.00 BLK_W: 0.73 (problem)
20140617 10:41: archive: RPS: 0.00 WPS: 1102.27 BLK_R: 0.00 BLK_W: 1.10
20140617 10:42: archive: RPS: 0.00 WPS: 551.73 BLK_R: 0.00 BLK_W: 0.73
20140617 10:43: archive: RPS: 0.00 WPS: 1101.07 BLK_R: 0.00 BLK_W: 1.07
20140617 10:44: archive: RPS: 0.00 WPS: 1103.47 BLK_R: 0.00 BLK_W: 1.13
20140617 10:45: archive: RPS: 0.00 WPS: 553.20 BLK_R: 0.00 BLK_W: 0.70
20140617 10:46: archive: RPS: 0.00 WPS: 7.07 BLK_R: 0.00 BLK_W: 0.73
20140617 10:47: archive: RPS: 0.00 WPS: 553.84 BLK_R: 0.00 BLK_W: 0.87
20140617 10:48: archive: RPS: 0.00 WPS: 554.53 BLK_R: 0.00 BLK_W: 0.87 (cache cleared)
20140617 10:49: archive: RPS: 0.00 WPS: 551.31 BLK_R: 0.00 BLK_W: 0.50
20140617 10:50: archive: RPS: 0.00 WPS: 8.53 BLK_R: 0.00 BLK_W: 0.87
20140617 10:51: archive: RPS: 0.00 WPS: 553.73 BLK_R: 0.00 BLK_W: 0.73
20140617 10:52: archive: RPS: 0.00 WPS: 3.87 BLK_R: 0.00 BLK_W: 0.87
20140617 10:53: archive: RPS: 0.00 WPS: 6.67 BLK_R: 0.00 BLK_W: 0.83
20140617 10:54: archive: RPS: 0.00 WPS: 554.27 BLK_R: 0.00 BLK_W: 0.77
20140617 10:55: archive: RPS: 848.80 WPS: 555.87 BLK_R: 848.80 BLK_W: 1.03
20140617 10:56: archive: RPS: 0.27 WPS: 1099.20 BLK_R: 0.27 BLK_W: 0.87
20140617 10:57: archive: RPS: 0.00 WPS: 1.33 BLK_R: 0.00 BLK_W: 0.17
20140617 10:58: archive: RPS: 0.00 WPS: 1.33 BLK_R: 0.00 BLK_W: 0.17
20140617 10:59: archive: RPS: 0.00 WPS: 1110.13 BLK_R: 0.00 BLK_W: 0.13
20140617 11:00: archive: RPS: 0.00 WPS: 553.87 BLK_R: 0.00 BLK_W: 0.37
20140617 11:01: archive: RPS: 0.00 WPS: 1102.13 BLK_R: 0.00 BLK_W: 0.80
20140617 11:02: archive: RPS: 0.00 WPS: 1647.87 BLK_R: 0.00 BLK_W: 1.03
20140617 11:03: archive: RPS: 0.00 WPS: 2191.60 BLK_R: 0.00 BLK_W: 0.60
20140617 11:04: archive: RPS: 0.00 WPS: 2194.27 BLK_R: 0.00 BLK_W: 0.93
20140617 11:05: archive: RPS: 0.00 WPS: 2193.33 BLK_R: 0.00 BLK_W: 1.33
20140617 11:06: archive: RPS: 0.00 WPS: 7.84 BLK_R: 0.00 BLK_W: 0.70
20140617 11:07: archive: RPS: 3229.33 WPS: 4.27 BLK_R: 3229.33 BLK_W: 0.43
20140617 11:08: archive: RPS: 3281.07 WPS: 6.40 BLK_R: 3281.07 BLK_W: 0.73
20140617 11:09: archive: RPS: 19427.33 WPS: 554.27 BLK_R: 19427.33 BLK_W: 0.77
20140617 11:10: archive: RPS: 0.00 WPS: 2191.47 BLK_R: 0.00 BLK_W: 0.63
20140617 11:11: archive: RPS: 0.00 WPS: 1102.27 BLK_R: 0.00 BLK_W: 0.60
20140617 11:12: archive: RPS: 0.00 WPS: 1102.00 BLK_R: 0.00 BLK_W: 0.67
20140617 11:13: archive: RPS: 0.00 WPS: 1101.60 BLK_R: 0.00 BLK_W: 0.77
20140617 11:14: archive: RPS: 0.00 WPS: 3829.47 BLK_R: 0.00 BLK_W: 0.80
20140617 11:15: archive: RPS: 0.00 WPS: 1104.13 BLK_R: 0.00 BLK_W: 0.97
20140617 11:16: archive: RPS: 0.00 WPS: 1101.87 BLK_R: 0.00 BLK_W: 0.87
20140617 11:17: archive: RPS: 1339.07 WPS: 1100.40 BLK_R: 1339.07 BLK_W: 0.90
20140617 11:18: archive: RPS: 0.00 WPS: 1100.67 BLK_R: 0.00 BLK_W: 0.83

Load average information leading up to the problem:
-----------------------------------------------------------------
20140617 10:30:01 top - 10:30:08 up 161 days, 19:46, 5 users, load average: 1.30, 1.22, 1.20
20140617 10:30:01 Tasks: 1347 total, 15 running, 1332 sleeping, 0 stopped, 0 zombie
--
20140617 10:31:01 top - 10:31:08 up 161 days, 19:47, 5 users, load average: 1.89, 1.35, 1.25
20140617 10:31:01 Tasks: 1339 total, 6 running, 1333 sleeping, 0 stopped, 0 zombie
--
20140617 10:32:01 top - 10:32:09 up 161 days, 19:48, 5 users, load average: 1.53, 1.33, 1.25
20140617 10:32:01 Tasks: 1359 total, 5 running, 1354 sleeping, 0 stopped, 0 zombie
--
20140617 10:33:01 top - 10:33:08 up 161 days, 19:49, 5 users, load average: 1.10, 1.25, 1.22
20140617 10:33:01 Tasks: 1333 total, 10 running, 1323 sleeping, 0 stopped, 0 zombie
--
20140617 10:34:01 top - 10:34:08 up 161 days, 19:50, 5 users, load average: 1.01, 1.23, 1.22
20140617 10:34:01 Tasks: 1368 total, 8 running, 1360 sleeping, 0 stopped, 0 zombie
--
20140617 10:35:01 top - 10:35:09 up 161 days, 19:51, 5 users, load average: 1.21, 1.26, 1.22
20140617 10:35:01 Tasks: 1359 total, 6 running, 1353 sleeping, 0 stopped, 0 zombie
--
20140617 10:36:01 top - 10:36:08 up 161 days, 19:52, 5 users, load average: 1.19, 1.25, 1.22
20140617 10:36:01 Tasks: 1356 total, 9 running, 1347 sleeping, 0 stopped, 0 zombie
--
20140617 10:37:01 top - 10:37:08 up 161 days, 19:53, 5 users, load average: 0.88, 1.15, 1.19
20140617 10:37:01 Tasks: 1339 total, 13 running, 1326 sleeping, 0 stopped, 0 zombie
--
20140617 10:38:01 top - 10:38:08 up 161 days, 19:54, 5 users, load average: 1.51, 1.24, 1.21
20140617 10:38:01 Tasks: 1356 total, 9 running, 1347 sleeping, 0 stopped, 0 zombie
--
20140617 10:39:01 top - 10:39:08 up 161 days, 19:55, 5 users, load average: 1.75, 1.35, 1.25
20140617 10:39:01 Tasks: 1339 total, 12 running, 1327 sleeping, 0 stopped, 0 zombie
--
20140617 10:40:01 top - 10:40:08 up 161 days, 19:56, 5 users, load average: 1.27, 1.27, 1.23
20140617 10:40:01 Tasks: 1359 total, 4 running, 1355 sleeping, 0 stopped, 0 zombie

* Start of problem *
--
20140617 10:41:02 top - 10:41:12 up 161 days, 19:57, 5 users, load average: 14.96, 4.47, 2.29
20140617 10:41:02 Tasks: 1831 total, 59 running, 1772 sleeping, 0 stopped, 0 zombie
--
20140617 10:43:01 top - 10:43:09 up 161 days, 19:59, 5 users, load average: 16.23, 10.94, 5.07
20140617 10:43:01 Tasks: 2203 total, 49 running, 2154 sleeping, 0 stopped, 0 zombie
--
20140617 10:44:01 top - 10:44:09 up 161 days, 20:00, 5 users, load average: 8.61, 9.60, 4.98
20140617 10:44:01 Tasks: 1942 total, 36 running, 1906 sleeping, 0 stopped, 0 zombie
--
20140617 10:45:01 top - 10:45:10 up 161 days, 20:01, 5 users, load average: 26.00, 14.28, 6.86
20140617 10:45:01 Tasks: 2226 total, 49 running, 2177 sleeping, 0 stopped, 0 zombie
--
20140617 10:46:02 top - 10:46:22 up 161 days, 20:02, 6 users, load average: 121.92, 47.56, 19.11
20140617 10:46:02 Tasks: 2248 total, 102 running, 2145 sleeping, 0 stopped, 1 zombie
--
20140617 10:47:01 top - 10:47:15 up 161 days, 20:03, 7 users, load average: 119.38, 58.95, 24.52
20140617 10:47:01 Tasks: 2257 total, 83 running, 2171 sleeping, 0 stopped, 2 zombie
--
20140617 10:48:01 top - 10:48:12 up 161 days, 20:04, 9 users, load average: 78.64, 57.52, 26.14
20140617 10:48:01 Tasks: 2247 total, 95 running, 2152 sleeping, 0 stopped, 0 zombie
--
20140617 10:55:01 top - 10:55:08 up 161 days, 20:11, 13 users, load average: 55.85, 64.71, 41.02
20140617 10:55:01 Tasks: 1275 total, 4 running, 1271 sleeping, 0 stopped, 0 zombie
--
20140617 10:56:01 top - 10:56:08 up 161 days, 20:12, 13 users, load average: 306.87, 132.30, 65.5
20140617 10:56:01 Tasks: 2007 total, 1 running, 2006 sleeping, 0 stopped, 0 zombie
--
20140617 10:57:01 top - 10:57:09 up 161 days, 20:13, 13 users, load average: 591.73, 241.81, 107.
20140617 10:57:01 Tasks: 2209 total, 2 running, 2207 sleeping, 0 stopped, 0 zombie
--
20140617 10:58:01 top - 10:58:08 up 161 days, 20:14, 13 users, load average: 858.24, 378.84, 162.
20140617 10:58:01 Tasks: 2266 total, 3 running, 2263 sleeping, 0 stopped, 0 zombie
--
20140617 10:59:02 top - 10:59:09 up 161 days, 20:15, 13 users, load average: 1068.68, 525.91, 226
20140617 10:59:02 Tasks: 2263 total, 1 running, 2262 sleeping, 0 stopped, 0 zombie
--
20140617 11:00:01 top - 11:00:08 up 161 days, 20:16, 13 users, load average: 1104.86, 639.53, 284
20140617 11:00:01 Tasks: 2279 total, 2 running, 2277 sleeping, 0 stopped, 0 zombie
--
20140617 11:01:01 top - 11:01:08 up 161 days, 20:17, 13 users, load average: 664.81, 602.11, 294.
20140617 11:01:01 Tasks: 2275 total, 6 running, 2269 sleeping, 0 stopped, 0 zombie
--
20140617 11:02:01 top - 11:02:09 up 161 days, 20:18, 13 users, load average: 485.78, 561.31, 299.
20140617 11:02:01 Tasks: 2272 total, 9 running, 2263 sleeping, 0 stopped, 0 zombie
--
20140617 11:03:01 top - 11:03:08 up 161 days, 20:19, 13 users, load average: 422.34, 529.23, 304.
20140617 11:03:01 Tasks: 2118 total, 4 running, 2114 sleeping, 0 stopped, 0 zombie
--
20140617 11:04:01 top - 11:04:09 up 161 days, 20:20, 13 users, load average: 371.39, 495.59, 307.
20140617 11:04:01 Tasks: 1953 total, 10 running, 1943 sleeping, 0 stopped, 0 zombie
--
20140617 11:05:01 top - 11:05:08 up 161 days, 20:21, 13 users, load average: 331.89, 461.73, 307.
20140617 11:05:01 Tasks: 1844 total, 8 running, 1836 sleeping, 0 stopped, 0 zombie
--
20140617 11:06:01 top - 11:06:12 up 161 days, 20:22, 13 users, load average: 213.18, 403.72, 297.
20140617 11:06:01 Tasks: 1913 total, 110 running, 1803 sleeping, 0 stopped, 0 zombie
--
20140617 11:10:01 top - 11:10:08 up 161 days, 20:26, 14 users, load average: 90.95, 229.57, 249.9
20140617 11:10:01 Tasks: 1326 total, 11 running, 1314 sleeping, 1 stopped, 0 zombie
--
20140617 11:11:01 top - 11:11:08 up 161 days, 20:27, 14 users, load average: 83.22, 202.67, 239.4
20140617 11:11:01 Tasks: 1544 total, 6 running, 1537 sleeping, 1 stopped, 0 zombie
--
20140617 11:12:01 top - 11:12:08 up 161 days, 20:28, 14 users, load average: 76.15, 179.03, 229.0
20140617 11:12:01 Tasks: 1486 total, 12 running, 1473 sleeping, 1 stopped, 0 zombie
--
20140617 11:13:01 top - 11:13:08 up 161 days, 20:29, 14 users, load average: 70.00, 158.61, 218.8
20140617 11:13:01 Tasks: 1518 total, 3 running, 1514 sleeping, 1 stopped, 0 zombie
--
20140617 11:14:01 top - 11:14:08 up 161 days, 20:30, 14 users, load average: 67.06, 141.63, 209.2
20140617 11:14:01 Tasks: 1439 total, 39 running, 1399 sleeping, 1 stopped, 0 zombie
--
20140617 11:15:01 top - 11:15:08 up 161 days, 20:31, 14 users, load average: 72.38, 128.89, 200.6
20140617 11:15:01 Tasks: 2279 total, 9 running, 2269 sleeping, 1 stopped, 0 zombie
--
20140617 11:16:01 top - 11:16:08 up 161 days, 20:32, 14 users, load average: 63.85, 116.17, 191.7
20140617 11:16:01 Tasks: 2002 total, 18 running, 1983 sleeping, 1 stopped, 0 zombie
--
20140617 11:17:01 top - 11:17:08 up 161 days, 20:33, 14 users, load average: 54.34, 103.79, 182.7
20140617 11:17:01 Tasks: 1759 total, 12 running, 1746 sleeping, 1 stopped, 0 zombie
--
20140617 11:18:01 top - 11:18:08 up 161 days, 20:34, 14 users, load average: 56.49, 95.20, 174.84
20140617 11:18:01 Tasks: 1630 total, 13 running, 1616 sleeping, 1 stopped, 0 zombie
--
20140617 11:19:01 top - 11:19:08 up 161 days, 20:35, 15 users, load average: 53.37, 87.20, 167.10
20140617 11:19:01 Tasks: 1501 total, 11 running, 1489 sleeping, 1 stopped, 0 zombie
--
20140617 11:20:01 top - 11:20:08 up 161 days, 20:36, 15 users, load average: 47.54, 79.38, 159.41
20140617 11:20:01 Tasks: 1415 total, 8 running, 1406 sleeping, 1 stopped, 0 zombie
--

3) know if transaction rates spiking going into (as opposed to during) the load is interesting. We can capture tps rates by logging the output of: select * from pg_stat_database. >>
We noticed that tps rate was not abnormal before, during and immediately after the issue happens. Old sessions were still "processing" and we had new sessions come in (like normal). When legitimate connections from our site came in while the existing old connections were still running in the database (either active/processing or active/waiting state), we are hitting the max_connections limit.

Few more information to add:
This is what we see happening almost all the time when we have the issue:
- At least from 5 mins before the issue (max_connections reached) happens, we see queries slowing down considerably (simple selects using indexes taking 10-15 mins, and fewer dmls)
- When the issue happens user cpu maxes out, i/o wait goes almost to zero, and load goes really high. Simple selects taking forever, and ton of dmls waiting.

Things we changed in the environment so far:
1. Disabled THP, and tcp offloading (GRP/GSO/TRO/TSO). They helped reduce the frequency of the issue but didn’t fix it
2. We restarted postgres with "numactl --interleave=all" last Thrusday (06-12-14) to alleviate the NUMA specific memory offloading issues.
3. We also dropped the disk cache a couple times till now.

Questions:
1) As mentioned earlier, we are in CentOS 6.5 and Kernel version 2.6.32. Are there any known issues with these versions? We read in some forums that downgrading to CentOS 5.8 and kernel 2.6.18 have helped with similar issues. Is it worth trying?
2) Should we look into changing our I/O scheduler from CFQ to deadline? (http://www.cybertec.at/postgresql-linux-kernel-io-tuning/)

Thank You
Ramya

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dave Peticolas 2014-06-18 22:11:29 Question about detecting database changes
Previous Message Andy Colson 2014-06-18 21:27:19 Re: Replacing a production db