From: | Bernd Helmle <mailings(at)oopsware(dot)de> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | PostgreSQL crashes with SIGSEGV |
Date: | 2017-12-07 15:47:18 |
Message-ID: | 1512661638.9720.34.camel@oopsware.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
A customer recently reported a crash in a postgres backend. The backend
encountered a SIGSEGV, crashing during SELECTs from a fairly
complicated view using a grouping set directive. I've managed to
reproduce it by tracking it down to a specific SELECT, but
unfortunately couldn't yet manage to strip it down to a small,
repeatable test case which doesn't involve the whole (sensitive)
dataset. I'm reporting my findings so far, maybe it helps to track it
down already.
The crashed backend gives the following backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007
1007 context = ((StandardChunkHeader *)
(gdb) bt
#0 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007
#1 0x00000000004798d8 in heap_free_minimal_tuple (mtup=0x17194e8) at
heaptuple.c:1403
#2 0x0000000000687202 in ExecClearTuple (slot=0x10a3f88) at
execTuples.c:455
#3 0x0000000000686b89 in ExecResetTupleTable (tupleTable=0x10a06a0,
shouldFree=0 '\000') at execTuples.c:169
#4 0x00000000006773ff in ExecEndPlan (planstate=0x10a0398,
estate=0x10a01f0) at execMain.c:1469
#5 0x0000000000675927 in standard_ExecutorEnd (queryDesc=0x10910c0) at
execMain.c:468
#6 0x0000000000675865 in ExecutorEnd (queryDesc=0x10910c0) at
execMain.c:439
#7 0x000000000062c71c in PortalCleanup (portal=0xf79ae0) at
portalcmds.c:280
#8 0x000000000099ce97 in PortalDrop (portal=0xf79ae0, isTopCommit=0
'\000') at portalmem.c:510
#9 0x0000000000813639 in exec_simple_query (
query_string=0xf3dbe0 "SELECT * FROM \"xxxx\".\"xxxx\" WHERE (xxxx
= 'xxxx') LIMIT 1000;") at postgres.c:1095
#10 0x00000000008177ea in PostgresMain (argc=1, argv=0xee9e10,
dbname=0xeb9a78 "xxxx", username=0xee9c78 "bernd")
at postgres.c:4072
#11 0x000000000078cc27 in BackendRun (port=0xedc700) at
postmaster.c:4342
#12 0x000000000078c377 in BackendStartup (port=0xedc700) at
postmaster.c:4016
#13 0x0000000000788983 in ServerLoop () at postmaster.c:1721
#14 0x0000000000787fb8 in PostmasterMain (argc=1, argv=0xeb7960) at
postmaster.c:1329
#15 0x00000000006d1b52 in main (argc=1, argv=0xeb7960) at main.c:228
I've tested this so far against very current REL9_6_STABLE and
REL9_5_STABLE and got them to crash with the same backtrace. The crash
is dependent on the chosen plan, experiments with work_mem show that
the crash seems to happen only if you get external sorts into the
execution plan.
REL10_STABLE seems not affected, as my extracted application query
doesn't crash there.
Running the query against REL9_6_STABLE with valgrind shows the
following results:
==00:00:01:33.067 13158== Invalid write of size 8
==00:00:01:33.067 13158== at 0x93D53B: AllocSetFree (aset.c:998)
==00:00:01:33.067 13158== by 0x93EC03: pfree (mcxt.c:1012)
==00:00:01:33.067 13158== by 0x476E34: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:01:33.067 13158== by 0x6521A3: ExecClearTuple
(execTuples.c:455)
==00:00:01:33.067 13158== by 0x651D4B: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.540 14381== Invalid read of size 8
==00:00:03:34.540 14381== at 0x99E29C: pfree (mcxt.c:1007)
==00:00:03:34.540 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.540 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.540 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.540 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.540 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.540 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.540 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.540 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.540 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.540 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.540 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:34.540 14381== Address 0x188cd220 is 474,080 bytes inside a
block of size 1,048,576 free'd
==00:00:03:34.540 14381== at 0x4C2FD18: free
(vg_replace_malloc.c:530)
==00:00:03:34.540 14381== by 0x99A8DB: AllocSetDelete (aset.c:653)
==00:00:03:34.540 14381== by 0x99C7A4: MemoryContextDelete
(mcxt.c:225)
==00:00:03:34.540 14381== by 0x99C855: MemoryContextDeleteChildren
(mcxt.c:245)
==00:00:03:34.540 14381== by 0x99C772: MemoryContextDelete
(mcxt.c:208)
==00:00:03:34.540 14381== by 0x9A5F10: tuplesort_end
(tuplesort.c:1198)
==00:00:03:34.540 14381== by 0x692568: ExecEndAgg (nodeAgg.c:3449)
==00:00:03:34.540 14381== by 0x67BB46: ExecEndNode
(execProcnode.c:755)
==00:00:03:34.540 14381== by 0x6AD652: ExecEndSubqueryScan
(nodeSubqueryscan.c:181)
==00:00:03:34.540 14381== by 0x67BA69: ExecEndNode
(execProcnode.c:697)
==00:00:03:34.540 14381== by 0x69DF6E: ExecEndLimit
(nodeLimit.c:438)
==00:00:03:34.540 14381== by 0x67BB9D: ExecEndNode
(execProcnode.c:779)
==00:00:03:34.540 14381== Block was alloc'd at
==00:00:03:34.540 14381== at 0x4C2EB6B: malloc
(vg_replace_malloc.c:299)
==00:00:03:34.540 14381== by 0x99AED4: AllocSetAlloc (aset.c:866)
==00:00:03:34.540 14381== by 0x99DC7A: palloc (mcxt.c:904)
==00:00:03:34.540 14381== by 0x4798F6: heap_copy_minimal_tuple
(heaptuple.c:1417)
==00:00:03:34.540 14381== by 0x6878D7: ExecCopySlotMinimalTuple
(execTuples.c:593)
==00:00:03:34.540 14381== by 0x9AC1D2: copytup_heap
(tuplesort.c:3998)
==00:00:03:34.540 14381== by 0x9A616C: tuplesort_puttupleslot
(tuplesort.c:1345)
==00:00:03:34.540 14381== by 0x68D7D8: fetch_input_tuple
(nodeAgg.c:601)
==00:00:03:34.540 14381== by 0x68FEC1: agg_retrieve_direct
(nodeAgg.c:2168)
==00:00:03:34.540 14381== by 0x68F9AB: ExecAgg (nodeAgg.c:1903)
==00:00:03:34.540 14381== by 0x67B70A: ExecProcNode
(execProcnode.c:503)
==00:00:03:34.540 14381== by 0x6AD40F: SubqueryNext
(nodeSubqueryscan.c:53)
==00:00:03:34.540 14381==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:pfree
fun:heap_free_minimal_tuple
fun:ExecClearTuple
fun:ExecResetTupleTable
fun:ExecEndPlan
fun:standard_ExecutorEnd
fun:ExecutorEnd
fun:PortalCleanup
fun:PortalDrop
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
}
==00:00:03:34.548 14381== Invalid read of size 4
==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010)
==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.548 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.548 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:34.548 14381== Address 0x7f7f7f7f7f7f7f7f is not stack'd,
malloc'd or (recently) free'd
==00:00:03:34.548 14381==
{
<insert_a_suppression_name_here>
Memcheck:Addr4
fun:pfree
fun:heap_free_minimal_tuple
fun:ExecClearTuple
fun:ExecResetTupleTable
fun:ExecEndPlan
fun:standard_ExecutorEnd
fun:ExecutorEnd
fun:PortalCleanup
fun:PortalDrop
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
}
==00:00:03:34.548 14381==
==00:00:03:34.548 14381== Process terminating with default action of
signal 11 (SIGSEGV): dumping core
==00:00:03:34.548 14381== General Protection Fault
==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010)
==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.548 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.548 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== Process terminating with default action of
signal 11 (SIGSEGV)
==00:00:03:35.088 14381== General Protection Fault
==00:00:03:35.088 14381== at 0x632016C: _dl_catch_error (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x631F8E6: __libc_dlclose (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x634B5E4: free_mem (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x634B1E1: __libc_freeres (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x4A296DB: _vgnU_freeres
(vg_preloaded.c:77)
==00:00:03:35.088 14381== by 0x18118EA7: ???
==00:00:03:35.088 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:35.088 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:35.088 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:35.088 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:35.088 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:35.088 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== HEAP SUMMARY:
==00:00:03:35.088 14381== in use at exit: 6,833,625 bytes in 531
blocks
==00:00:03:35.088 14381== total heap usage: 814,111 allocs, 1,264
frees, 118,761,978 bytes allocated
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== For a detailed leak analysis, rerun with: --
leak-check=full
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== For counts of detected and suppressed errors,
rerun with: -v
==00:00:03:35.088 14381== ERROR SUMMARY: 2 errors from 2 contexts
(suppressed: 35 from 6)
It seems the backend tries to free a minimal tuple at executor end,
which is already gone by deleting the memory context it was allocated
in before. ExecResetTupleTable() is looping through a list with 70
entries, it encounters (after 6 or seven rounds) the first tuple slot
with tts_shouldFreeMin set, all others before don't have it set:
(gdb) p *slot
$6 = {type = T_TupleTableSlot, tts_isempty = 0 '\000', tts_shouldFree =
0 '\000', tts_shouldFreeMin = 1 '\001', tts_slow = 1 '\001',
tts_tuple = 0x10a6c48, tts_tupleDescriptor = 0x10d1be8, tts_mcxt =
0xf59668, tts_buffer = 0, tts_nvalid = 6, tts_values = 0x10d2640,
tts_isnull = 0x10d26d8 "", tts_mintuple = 0x171c168, tts_minhdr =
{t_len = 162, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0},
ip_posid = 0}, t_tableOid = 0, t_data = 0x171c160}, tts_off = 88}
(gdb) p list_length(tupleTable)
tts_mintuple is invalid, though:
(gdb) p *slot->tts_mintuple
Cannot access memory at address 0x171c168
The following attempt to clean it in ExecClearTuple() then crashes the
backend.
Bernd
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2017-12-07 17:13:53 | Re: BUG #14952: COPY fails to fill in IDENTITY column default value |
Previous Message | Tom Lane | 2017-12-07 15:27:24 | Re: BUG #14953: pg_blocking_pids / array_length inconsistency |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2017-12-07 16:02:32 | Re: Speeding up pg_upgrade |
Previous Message | Stephen Frost | 2017-12-07 15:37:30 | Re: Speeding up pg_upgrade |