From: | Dmitry Marakasov <amdmi3(at)amdmi3(dot)ru> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #16696: Backend crash in llvmjit |
Date: | 2020-11-04 23:50:54 |
Message-ID: | 20201104235054.GB30304@hades.panopticon |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
* Dmitry Marakasov (amdmi3(at)amdmi3(dot)ru) wrote:
> > > > Environment details:
> > > > - FreeBSD 12.1 amd64
> > > > - PostgreSQL 13.0 (built from FreeBSD ports)
> > > > - llvm-10.0.1 (build from FreeBSD ports)
> > >
> > > My bad, it's actually llvm-9.0.1. Multiple llvm versions are installed on
> > > the system, and PostgreSQL uses llvm9:
> > >
> > > ldd /usr/local/lib/postgresql/llvmjit.so | grep LLVM
> > > libLLVM-9.so => /usr/local/llvm90/lib/libLLVM-9.so (0x800e00000)
> >
> > Could you try generating a backtrace after turning jit_debugging_support on? That might give a bit more information.
> >
> > I'll check once I'm home whether I can reproduce in my environment.
>
> I did some digging. First of all, I've discovered that the problem
> goes away if llvm bitcode optimization is disabled (by commenting out
> llvm_optimize_module call).
>
> I've dumped the opcode and tried compiling it back to match disassembly
> of the failing function in gdb disassembly. It didn't match perfectly,
> but this place looked similar:
>
> # %bb.84: # %op.32.inputcall
> movq %rax, 5267(%r13)
> movb %bl, 5275(%r13)
> movb $0, 5263(%r13)
> movzbl (%rax), %esi
> movl __mb_sb_limit(%rip), %edi
> movq _ThreadRuneLocale(at)GOTTPOFF(%rip), %rcx
> movq %fs:0, %rdx
> movq (%rdx,%rcx), %rcx
> cmpl %esi, %edi
> movq %rax, -96(%rbp) # 8-byte Spill
> movl %edi, -72(%rbp) # 4-byte Spill
> movq %rcx, -64(%rbp) # 8-byte Spill
> jle .LBB1_85
>
> Here's my hypothesis:
>
> The problem happens when boolin() function is inlined by LLVM.
> The named function calls isspace() internally, which on FreeBSD is
> locale-specific and involves caching some locale parameters in
> thread-local variable defined as
>
> extern _Thread_local const _RuneLocale *_ThreadRuneLocale;
>
> The execution crashes on trying to access the named thread-local varible,
> probably because something related to TLS is not set up properly in/for
> LLVM.
>
> I've confirmed this hypothesis by disabling isspace() calls in boolin()
> which has also fixed the problem.
Long story short, I was able to mitigate the crash with the following patch:
--- disable-inlining-tls-using-functions.patch begins here ---
commit f703544edc406293e39b7a59a245e798d18f458e
Author: Dmitry Marakasov <amdmi3(at)amdmi3(dot)ru>
Date: Thu Nov 5 02:56:00 2020 +0300
Do not inline functions accessing TLS in LLVM JIT
diff --git src/backend/jit/llvm/llvmjit_inline.cpp src/backend/jit/llvm/llvmjit_inline.cpp
index 2617a46..a063edb 100644
--- src/backend/jit/llvm/llvmjit_inline.cpp
+++ src/backend/jit/llvm/llvmjit_inline.cpp
@@ -608,6 +608,16 @@ function_inlinable(llvm::Function &F,
if (rv->materialize())
elog(FATAL, "failed to materialize metadata");
+ /*
+ * Don't inline functions with thread-local variables until
+ * related crashes are investigated (see BUG #16696)
+ */
+ if (rv->isThreadLocal()) {
+ ilog(DEBUG1, "cannot inline %s due to thread-local variable %s",
+ F.getName().data(), rv->getName().data());
+ return false;
+ }
+
/*
* Never want to inline externally visible vars, cheap enough to
* reference.
--- disable-inlining-tls-using-functions.patch ends here ---
I have no knowledge of LLVM to investigate this further, but the guess
is that something TLS related is not initialized properly.
--
Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D
amdmi3(at)amdmi3(dot)ru ..: https://github.com/AMDmi3
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2020-11-05 03:43:17 | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Previous Message | Dmitry Marakasov | 2020-11-04 21:20:15 | Re: BUG #16696: Backend crash in llvmjit |