Out of Memory errors are frustrating as heck!

From: Gunther <raj(at)gusw(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Out of Memory errors are frustrating as heck!
Date: 2019-04-14 20:23:34
Message-ID: bc138e9f-c89e-9147-5395-61d51a757b3b@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

For weeks now, I am banging my head at an "out of memory" situation.
There is only one query I am running on an 8 GB system, whatever I try,
I get knocked out on this out of memory. It is extremely impenetrable to
understand and fix this error. I guess I could add a swap file, and then
I would have to take the penalty of swapping. But how can I actually
address an out of memory condition if the system doesn't tell me where
it is happening?

You might want to see the query, but it is a huge plan, and I can't
really break this down. It shouldn't matter though. But just so you can
get a glimpse here is the plan:

Insert on businessoperation (cost=5358849.28..5361878.44 rows=34619 width=1197)
-> Unique (cost=5358849.28..5361532.25 rows=34619 width=1197)
-> Sort (cost=5358849.28..5358935.83 rows=34619 width=1197)
Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.is_current, documentinformationsubject.documentid, documentinformationsubject.documenttypecode, documentinformationsubject.subjectroleinternalid, documentinformationsubject.subjectentityinternalid, documentinformationsubject.subjectentityid, documentinformationsubject.subjectentityidroot, documentinformationsubject.subjectentityname, documentinformationsubject.subjectentitytel, documentinformationsubject.subjectentityemail, documentinformationsubject.otherentityinternalid, documentinformationsubject.confidentialitycode, documentinformationsubject.actinternalid, documentinformationsubject.code_code, documentinformationsubject.code_displayname, q.code_code, q.code_displayname, an.extension, an.root, documentinformationsubject_2.subjectentitycode, documentinformationsubject_2.subjectentitycodesystem, documentinformationsubject_2.effectivetime_low, documentinformationsubject_2.effectivetime_high, documentinformationsubject_2.statuscode, documentinformationsubject_2.code_code, agencyid.extension, agencyname.trivialname, documentinformationsubject_1.subjectentitycode, documentinformationsubject_1.subjectentityinternalid
-> Nested Loop Left Join (cost=2998335.54..5338133.63 rows=34619 width=1197)
Join Filter: (((documentinformationsubject.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text))
-> Merge Left Join (cost=2998334.98..3011313.54 rows=34619 width=930)
Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))
-> Sort (cost=1408783.87..1408870.41 rows=34619 width=882)
Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid
-> Seq Scan on documentinformationsubject (cost=0.00..1392681.22 rows=34619 width=882)
Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
-> Materialize (cost=1589551.12..1594604.04 rows=1010585 width=159)
-> Sort (cost=1589551.12..1592077.58 rows=1010585 width=159)
Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid
-> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=159)
Filter: ((participationtypecode)::text = 'PRD'::text)
-> Materialize (cost=0.56..2318944.31 rows=13 width=341)
-> Nested Loop Left Join (cost=0.56..2318944.24 rows=13 width=341)
-> Nested Loop Left Join (cost=0.00..2318893.27 rows=1 width=281)
Join Filter: ((agencyname.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)
-> Nested Loop Left Join (cost=0.00..2286828.33 rows=1 width=291)
Join Filter: ((agencyid.entityinternalid)::text = (documentinformationsubject_2.otherentityinternalid)::text)
-> Nested Loop Left Join (cost=0.00..2284826.24 rows=1 width=239)
Join Filter: (((q.documentinternalid)::text = (documentinformationsubject_2.documentinternalid)::text) AND ((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text))
-> Nested Loop (cost=0.00..954957.59 rows=1 width=136)
Join Filter: ((q.actinternalid)::text = (r.sourceinternalid)::text)
-> Seq Scan on actrelationship r (cost=0.00..456015.26 rows=1 width=74)
Filter: ((typecode)::text = 'SUBJ'::text)
-> Seq Scan on documentinformation q (cost=0.00..497440.84 rows=120119 width=99)
Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
-> Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1329868.64 rows=1 width=177)
Filter: ((participationtypecode)::text = 'AUT'::text)
-> Seq Scan on entity_id agencyid (cost=0.00..1574.82 rows=34182 width=89)
-> Seq Scan on bestname agencyname (cost=0.00..27066.08 rows=399908 width=64)
-> Index Scan using act_id_fkidx on act_id an (cost=0.56..50.85 rows=13 width=134)
Index Cond: ((q.actinternalid)::text = (actinternalid)::text)

I have monitored the activity with vmstat and iostat, and it looks like
the memory grabbing happens rapidly after a Sort Merge step. I see in
the iostat a heavy read and write activity, which I attribute to a
sort-merge step, then that is followed by a sudden spike in write
activity, and then the out of memory crash.

procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----
r b swpd free buff cache si so bi bo in cs us sy id wa st UTC
0 2 0 119344 0 7616672 0 0 11681 3107 9 0 6 1 72 21 0 2019-04-14 16:19:52
0 2 0 128884 0 7607288 0 0 2712 55386 500 509 3 2 15 80 0 2019-04-14 16:19:54
0 2 0 116984 0 7619916 0 0 880 59241 548 525 2 2 9 87 0 2019-04-14 16:19:56
0 2 0 131492 0 7604816 0 0 128 56512 518 401 1 1 12 86 0 2019-04-14 16:19:58
...
0 2 0 134508 0 7601480 0 0 0 58562 428 353 0 1 4 95 0 2019-04-14 16:21:46
0 2 0 125360 0 7611320 0 0 0 59392 481 369 0 1 11 89 0 2019-04-14 16:21:48
0 2 0 122896 0 7612872 0 0 0 58564 431 342 0 1 17 82 0 2019-04-14 16:21:50
1 1 0 121456 0 7614248 0 0 54 57347 487 399 0 1 13 85 0 2019-04-14 16:21:52
0 2 0 122820 0 7613324 0 0 12 59964 460 346 0 1 20 79 0 2019-04-14 16:21:54
0 2 0 120344 0 7616528 0 0 1844 55691 645 676 5 3 6 85 0 2019-04-14 16:21:56
0 2 0 124900 0 7611404 0 0 936 58261 795 1215 2 3 13 83 0 2019-04-14 16:21:58
0 2 0 124572 0 7612192 0 0 1096 55340 518 487 1 2 0 97 0 2019-04-14 16:22:00
0 2 0 123040 0 7612740 0 0 888 57574 573 620 1 2 5 92 0 2019-04-14 16:22:02
0 2 0 125112 0 7610592 0 0 124 59164 498 480 1 1 13 85 0 2019-04-14 16:22:04
1 1 0 129440 0 7607592 0 0 568 60196 563 612 2 2 8 88 0 2019-04-14 16:22:06
0 2 0 124020 0 7612364 0 0 0 58260 629 725 0 1 8 91 0 2019-04-14 16:22:08
2 1 0 124480 0 7611848 0 0 0 58852 447 331 0 1 1 98 0 2019-04-14 16:22:10
0 3 0 137636 0 7598484 0 0 11908 44995 619 714 1 1 11 87 0 2019-04-14 16:22:12
0 2 0 123128 0 7613392 0 0 29888 28901 532 972 1 1 29 68 0 2019-04-14 16:22:14
0 2 0 126260 0 7609984 0 0 39872 18836 706 1435 1 2 28 70 0 2019-04-14 16:22:16
0 2 0 130748 0 7605536 0 0 36096 22488 658 1272 2 1 8 89 0 2019-04-14 16:22:18
...
 0 2 0 127216 0 7609192 0 0 29192 29696 472 949 1 1 23 75 0 2019-04-14 16:22:40
0 2 0 147428 0 7588556 0 0 29120 29696 523 974 1 1 19 79 0 2019-04-14 16:22:42
0 1 0 120644 0 7615388 0 0 32320 25276 566 998 1 2 49 47 0 2019-04-14 16:22:44
0 1 0 128456 0 7607904 0 0 58624 0 621 1103 3 2 49 46 0 2019-04-14 16:22:46
0 1 0 127836 0 7608260 0 0 58624 0 631 1119 3 2 50 46 0 2019-04-14 16:22:48
0 1 0 126712 0 7609616 0 0 58624 0 616 1110 2 2 50 47 0 2019-04-14 16:22:50
...
0 1 0 157408 0 7578060 0 0 58628 0 736 1206 3 3 50 44 0 2019-04-14 16:27:22
0 1 0 142420 0 7593400 0 0 58688 0 623 1099 1 4 50 45 0 2019-04-14 16:27:24
0 1 0 247016 0 7488184 0 0 58568 0 649 1113 1 4 50 45 0 2019-04-14 16:27:26
0 1 0 123232 0 7612088 0 0 58412 215 675 1141 2 3 50 46 0 2019-04-14 16:27:28
0 2 0 144920 0 7586576 0 0 48376 11046 788 1455 1 5 34 60 0 2019-04-14 16:27:30
1 1 0 125636 0 7595704 0 0 36736 21381 702 1386 1 4 21 74 0 2019-04-14 16:27:32
0 3 0 156700 0 7559328 0 0 35556 23364 709 1367 1 3 22 74 0 2019-04-14 16:27:34
0 2 0 315580 0 7382748 0 0 33608 24731 787 1407 1 5 18 76 0 2019-04-14 16:27:36
...
0 2 0 684412 0 6152040 0 0 29832 28356 528 994 1 2 32 66 0 2019-04-14 16:38:04
0 2 0 563512 0 6272264 0 0 29696 29506 546 987 1 2 32 65 0 2019-04-14 16:38:06
0 2 0 595488 0 6241068 0 0 27292 30858 549 971 1 2 26 71 0 2019-04-14 16:38:08
0 2 0 550120 0 6285352 0 0 28844 29696 567 995 1 2 29 68 0 2019-04-14 16:38:10
1 1 0 432380 0 6402964 0 0 28992 29696 557 979 1 2 37 61 0 2019-04-14 16:38:12
0 2 0 445796 0 6384412 0 0 26768 32134 628 1029 1 4 27 69 0 2019-04-14 16:38:14
0 2 0 374972 0 6453592 0 0 28172 30839 529 962 1 2 43 54 0 2019-04-14 16:38:16
0 2 0 317824 0 6507992 0 0 29172 29386 560 1001 1 3 27 68 0 2019-04-14 16:38:18
0 3 0 215092 0 6609132 0 0 33116 25210 621 1148 1 3 19 77 0 2019-04-14 16:38:20
0 2 0 194836 0 6621524 0 0 27786 30959 704 1152 0 5 18 77 0 2019-04-14 16:38:22
0 3 0 315648 0 6500196 0 0 31434 27226 581 1073 0 3 31 65 0 2019-04-14 16:38:24
*0 2 0 256180 0 6554676 0 0 29828 29017 668 1174 0 4 20 76 0 2019-04-14
16:38:26* <<< CRASH
0 1 0 378220 0 6552496 0 0 4348 53686 2210 3816 1 5 46 49 0 2019-04-14 16:38:28
0 1 0 389888 0 6536296 0 0 2704 56529 2454 4178 0 5 42 52 0 2019-04-14 16:38:30
0 2 0 923572 0 5998992 0 0 1612 56863 2384 3928 0 6 16 78 0 2019-04-14 16:38:32
0 0 0 908336 0 6006696 0 0 3584 49280 8961 17334 0 19 39 42 0 2019-04-14 16:38:34
0 1 0 1306480 0 5607088 0 0 264 63632 18605 37933 3 58 35 4 0 2019-04-14 16:38:36
2 1 0 1355448 0 5558576 0 0 8 59222 14817 30296 2 46 24 27 0 2019-04-14 16:38:38
2 2 0 1358224 0 5555884 0 0 0 58544 14226 28331 2 44 3 50 0 2019-04-14 16:38:40
2 1 0 1446348 0 5468748 0 0 0 58846 14376 29185 2 44 11 42 0 2019-04-14 16:38:42
0 0 0 2639648 0 4357608 0 0 0 28486 12909 26770 2 44 49 5 0 2019-04-14 16:38:44
0 0 0 2639524 0 4357800 0 0 0 0 158 154 0 0 100 0 0 2019-04-14 16:38:46
0 0 0 2687316 0 4309976 0 0 0 0 181 188 0 2 98 0 0 2019-04-14 16:38:48
0 0 0 2706920 0 4300116 0 0 0 105 137 263 0 0 100 0 0 2019-04-14 16:38:50
0 0 0 2706672 0 4300232 0 0 0 0 142 204 0 0 100 0 0 2019-04-14 16:38:52
0 0 0 2815116 0 4191928 0 0 0 0 116 242 0 0 100 0 0 2019-04-14 16:38:54
0 0 0 2815364 0 4192008 0 0 0 0 116 239 0 0 100 0 0 2019-04-14 16:38:56
0 0 0 2815116 0 4192164 0 0 0 0 159 236 0 0 100 0 0 2019-04-14 16:38:58

ending after the out of memory crash, that occurred exactly at the
marked point 16:38:26.355 UTC.

We can't really see anything too worrisome. There is always lots of
memory used by cache, which could have been mobilized. The only possible
explanation I can think of is that in that moment of the crash the
memory utilization suddenly skyrocketed in less than a second, so that
the 2 second vmstat interval wouldn't show it??? Nah.

I have already much reduced work_mem, which has helped in some other
cases before. Now I am going to reduce the shared_buffers now, but that
seems counter-intuitive because we are sitting on all that cache memory
unused!

Might this be a bug? It feels like a bug. It feels like those out of
memory issues should be handled more gracefully (garbage collection
attempt?) and that somehow there should be more information so the
person can do anything about it.

Any ideas?

-Gunther

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2019-04-14 21:19:11 Re: Out of Memory errors are frustrating as heck!
Previous Message Andreas Kretschmer 2019-04-10 08:20:58 Re: PostgreSQL upgrade.