Re: Out of Memory errors are frustrating as heck!

From: Gunther <raj(at)gusw(dot)net>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: Out of Memory errors are frustrating as heck!
Date: 2019-04-15 16:34:38
Message-ID: 2682858c-7916-11fe-d7e8-7c649afbbc74@gusw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

OK Guys, you are very kind to continue taking an interest in this matter.

I will try what I can to help squish the bug.

Tomas Vondra just added a good idea that explains why I get the out of
memory with still having so much cache available:

# sysctl vm.overcommit_memory
vm.overcommit_memory = 2
# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50

as he predicted.

# cat /proc/meminfo |grep Commit
CommitLimit:     3955192 kB
Committed_AS:    2937352 kB

So I thing that explains why it turns into an out of memory error. We
don't worry or wonder about that any more. I will change that parameter
in the future to allow for some spikes. But it's not going to resolve
the underlying memory leak issue.

Now I run explain analyze SELECT ... without the INSERT.

integrator=# \set VERBOSITY verbose
integrator=#
integrator=# \pset pager off
Pager usage is off.
integrator=# \pset format unaligned
Output format is unaligned.
integrator=# \set VERBOSITY verbose
integrator=#
integrator=# SET ENABLE_NESTLOOP TO OFF;
SET
integrator=# explain analyze SELECT * FROM reports.v_BusinessOperation;
ERROR: 53200: out of memory
DETAIL: Failed on request of size 32800 in memory context "HashBatchContext".
LOCATION: MemoryContextAlloc, mcxt.c:798

And since that failed already, I guess we don't need to worry about the
temporary table insert.

About adding LIMIT, I don't think it makes sense in the outer query,
since the error is probably happening earlier. I did put a LIMIT 100 on
one of the tables we join to, and it helped. But that doesn't really
tell us anything I think.

Then yes, I can try the backtrace with the NLs enabled. It will just
take a long long time and unfortunately it is extremely likely that I
lose the console and then will be unable to get back to it. OK,
screen(1) resolves that problem too. Will do, after I reported the above.

But now you have already produced more ideas ...

>>   Maybe it is memory for trigger or constraint checking, although I
>> don't
>>   know why that would appear instantly.  What triggers or constraints
>> do you
>>   have on businessoperation?
>
> Yeah, that would be my guess too. If I had to guess, something likely
> gets
> confused and allocates memory in es_query_ctx instead of the per-tuple
> context (es_per_tuple_exprcontext).
>
> Triggers, constraints and expr evaluation all seem like a plausible
> candidates. It's going to be hard to nail the exact place, though

I think triggers and constraints is ruled out, because the problem
happens without the INSERT.

That leaves us with expression evaluation. And OK, now you really wanna
see the query, although it should be in the plan too. But for what it is
worth:

SELECT DISTINCT
documentInternalId, is_current,
documentId,
documentTypeCode,
subjectRoleInternalId,
subjectEntityInternalId,
subjectEntityId,
subjectEntityIdRoot,
subjectEntityName,
subjectEntityTel,
subjectEntityEmail,
otherEntityInternalId,
confidentialityCode,
actInternalId,
code_code as operationCode,
code_displayName AS operationName,
operationQualifierCode,
operationQualifierName,
approvalNumber,
approvalNumberSystem,
approvalStateCode,
approvalStateCodeSystem,
approvalEffectiveTimeLow,
approvalEffectiveTimeHigh,
approvalStatusCode,
licenseCode,
agencyId,
agencyName,
productItemCode,
productInternalId
FROM reports.DocumentInformationSubject
LEFT OUTER JOIN (SELECT documentInternalId, documentId, actInternalId,
subjectEntityCode as productItemCode,
subjectEntityInternalId as productInternalId
FROM reports.DocumentInformationSubject
WHERE participationTypeCode = 'PRD') prd
USING(documentInternalId, documentId, actInternalId)
LEFT OUTER JOIN (
SELECT documentInternalId,
q.code_code AS operationQualifierCode,
q.code_displayName AS operationQualifierName,
r.targetInternalId AS actInternalId,
actInternalId AS approvalInternalId,
an.extension AS approvalNumber,
an.root AS approvalNumberSystem,
qs.subjectEntityCode AS approvalStateCode,
qs.subjectEntityCodeSystem AS approvalStateCodeSystem,
qs.effectivetime_low AS approvalEffectiveTimeLow,
qs.effectivetime_high AS approvalEffectiveTimeHigh,
qs.statusCode AS approvalStatusCode,
qs.code_code AS licenseCode,
agencyId.extension AS agencyId,
agencyName.trivialName AS agencyName
FROM reports.DocumentInformation q
LEFT OUTER JOIN (SELECT * FROM reports.DocumentInformationSubject WHERE participationTypeCode = 'AUT') qs
USING(documentInternalId, actInternalId)
INNER JOIN integrator.ActRelationship r
ON( r.sourceInternalId = actInternalId
AND r.typeCode = 'SUBJ')
LEFT OUTER JOIN integrator.Act_id an USING(actInternalId)
LEFT OUTER JOIN integrator.Entity_id agencyId ON(agencyId.entityInternalId = otherEntityInternalId)
LEFT OUTER JOIN reports.BestName agencyName ON(agencyName.entityInternalId = otherEntityInternalId)
WHERE q.classCode = 'CNTRCT'
AND q.moodCode = 'EVN'
AND q.code_codeSystem = '2.16.840.1.113883.3.26.1.1'
) q
USING(documentInternalId, actInternalId)
WHERE classCode = 'ACT'
AND moodCode = 'DEF'
AND code_codeSystem = '2.16.840.1.113883.3.26.1.1'
AND participationTypeCode IN ('PPRF','PRF');

You see that the expressions are all just equal operations, some IN,
nothing outlandish.

Now I will try what Tom Lane suggested. Here you go. And I have it
stopped at this state, so if you want me to inspect anything else, I can
do it.

With screen(1) I can be sure I won't lose my stuff when my internet goes
down. Nice.

I have one screen session with 3 windows:

1. psql
2. gdb
3. misc (vmstat, etc.)

Now I have let this run for a good long time while setting up my screen
stuff. And then:

ps -x

look for the postgres job with the EXPLAIN ... that's $PID, then:

gdb -p $PID

Then first I do

cont

but then it stops at SIGUSR1, because of the parallel workers signalling
each other.

handle SIGUSR1 nostop

suppresses that stopping. Then I break CTRL-C, and set the breakpoint
where Tom Lane said:

b AllocSetAlloc

once it stops there I do

Breakpoint 1, AllocSetAlloc (context=0x1168230, size=8) at aset.c:715
715 {
(gdb) p context->name
$4 = 0x96ce5b "ExecutorState"

So I should even be able to set a conditional breakpoint.

(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) b AllocSetAlloc if strcmp(context->name, "ExecutorState") == 0
Breakpoint 2 at 0x848ed0: file aset.c, line 715.
(gdb) cont
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=10) at aset.c:715
715 {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=152) at aset.c:715
715 {
(gdb) cont
Continuing.

Program received signal SIGUSR1, User defined signal 1.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=201) at aset.c:715
715 {
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
715 {
(gdb) p context->name
$8 = 0x96ce5b "ExecutorState"

Nice. Now the question is, am I at the place where memory gets squeezed?
And I think yes. With top

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31752 postgres 20 0 2772964 1.2g 329640 t 0.0 16.5 8:46.59 postgres: postgres integrator [local] EXPLAIN

I guess I should run this for a little longer. So I disable my breakpoints

(gdb) info breakpoints
Num Type Disp Enb Address What
2 breakpoint keep y 0x0000000000848ed0 in AllocSetAlloc at aset.c:715
stop only if strcmp(context->name, "ExecutorState") == 0
breakpoint already hit 6 times
(gdb) disable 2
(gdb) cont
Continuing.

while watching top:

31752 postgres 20 0 2777060 1.3g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres 20 0 2777060 1.4g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN

31752 postgres 20 0 2777060 1.5g 329920 D 33.2 17.9 8:52.07 postgres: postgres integrator [local] EXPLAIN

it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped
growing fast, so now back to gdb and break:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
715 {

Now I give you a bt so we have something to look at:

#0 AllocSetAlloc (context=0x1168230, size=385) at aset.c:715
#1 0x000000000084e6cd in palloc (size=385) at mcxt.c:938
#2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file(at)entry=0x8bbc528, hashvalue=hashvalue(at)entry=0x7fff2e4ca76c,
tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7 0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8 ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)
at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt(at)entry=0x1199a68, into=into(at)entry=0x0, es=es(at)entry=0x1141d48,
queryString=<optimized out>, params=0x0, queryEnv=queryEnv(at)entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,
queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)
at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate(at)entry=0xf74608, stmt=stmt(at)entry=0x11ef240,
queryString=queryString(at)entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
params=params(at)entry=0x0, queryEnv=queryEnv(at)entry=0x0, dest=dest(at)entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,
queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,
setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal(at)entry=0xfb06b0, isTopLevel=isTopLevel(at)entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal(at)entry=0xfb06b0, count=count(at)entry=9223372036854775807,
isTopLevel=isTopLevel(at)entry=true, run_once=run_once(at)entry=true, dest=dest(at)entry=0xf4c570, altdest=altdest(at)entry=0xf4c570,
completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (
query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xf76ce8, dbname=<optimized out>,
username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

But who knows if that's it. I continue and watch top again...

31752 postgres 20 0 3112352 1.8g 329920 D 32.2 23.7 9:43.75 postgres: postgres integrator [local] EXPLAIN

it went quickly to 1.6, then after some time to 1.7, then 1.8, and I
stop again:

^C
Program received signal SIGINT, Interrupt.
0x00007f048f336d71 in read () from /lib64/libpthread.so.0
(gdb) enable 2
(gdb) cont
Continuing.

Breakpoint 2, AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
715 {
bt
#0 AllocSetAlloc (context=0x1168230, size=375) at aset.c:715
#1 0x000000000084e6cd in palloc (size=375) at mcxt.c:938
#2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file(at)entry=0x21df688, hashvalue=hashvalue(at)entry=0x7fff2e4ca76c,
tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277
#3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042
#4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539
#5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565
#6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461
#7 0x000000000061ce4e in ExecProcNode (node=0x11688e0) at ../../../src/include/executor/executor.h:247
#8 ExecSort (pstate=0x11687d0) at nodeSort.c:107
#9 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461
#10 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247
#11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73
#12 0x00000000005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461
#13 0x00000000005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247
#14 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x11685e0, estate=0x1168340)
at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#16 0x000000000059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt(at)entry=0x1199a68, into=into(at)entry=0x0, es=es(at)entry=0x1141d48,
queryString=<optimized out>, params=0x0, queryEnv=queryEnv(at)entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535
#17 0x000000000059c9ef in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x1141d48,
queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0)
at explain.c:371
#18 0x000000000059ce37 in ExplainQuery (pstate=pstate(at)entry=0xf74608, stmt=stmt(at)entry=0x11ef240,
queryString=queryString(at)entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
params=params(at)entry=0x0, queryEnv=queryEnv(at)entry=0x0, dest=dest(at)entry=0xf74578) at explain.c:254
#19 0x000000000072ca5d in standard_ProcessUtility (pstmt=0x11ef390,
queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675
#20 0x000000000072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=<optimized out>,
setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7fff2e4cab20 "") at pquery.c:1178
#21 0x000000000072add2 in FillPortalStore (portal=portal(at)entry=0xfb06b0, isTopLevel=isTopLevel(at)entry=true) at pquery.c:1038
#22 0x000000000072b855 in PortalRun (portal=portal(at)entry=0xfb06b0, count=count(at)entry=9223372036854775807,
isTopLevel=isTopLevel(at)entry=true, run_once=run_once(at)entry=true, dest=dest(at)entry=0xf4c570, altdest=altdest(at)entry=0xf4c570,
completionTag=0x7fff2e4cad30 "") at pquery.c:768
#23 0x00000000007276e8 in exec_simple_query (
query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145
#24 0x0000000000729534 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xf76ce8, dbname=<optimized out>,
username=<optimized out>) at postgres.c:4182
#25 0x00000000006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361
#26 BackendStartup (port=0xf6efe0) at postmaster.c:4033
#27 ServerLoop () at postmaster.c:1706
#28 0x00000000006bf122 in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0xf45320) at postmaster.c:1379
#29 0x00000000004822dc in main (argc=3, argv=0xf45320) at main.c:228

Good, now I leave this all sitting like that for you to ask me what else
you might want to see.

We are now close to the edge of the cliff.

-Gunther

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2019-04-15 16:51:56 Re: Out of Memory errors are frustrating as heck!
Previous Message Tomas Vondra 2019-04-15 16:19:06 Re: Block / Page Size Optimization