Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION

From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, chris(dot)tessels(at)inergy(dot)nl, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13985: Segmentation fault on PREPARE TRANSACTION
Date: 2016-02-25 08:02:07
Message-ID: CACACo5RdOOZLS2fRELPO77Ob0jK+2k8mhFVJ2UBUS157FTp7rA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Feb 24, 2016 at 10:52 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2016-02-24 17:52:37 -0300, Alvaro Herrera wrote:
> > chris(dot)tessels(at)inergy(dot)nl wrote:
> >
> > > Core was generated by `postgres: mailinfo_ow mailinfo_ods
> 10.50.6.6(4188'.
> > > Program terminated with signal 11, Segmentation fault.
> > >
> > > #0 MinimumActiveBackends (min=50) at procarray.c:2472
> > > 2472 if (pgxact->xid == InvalidTransactionId)
> >
> > It's not surprising that you're not able to make this crash
> > consistently, because it looks like the problem might be in concurrent
> > modifications to the PGXACT array. This routine, MinimumActiveBackends,
> > walks the PGPROC array explicitely without locks. There are comments
> > indicating that this is safe, but evidently something has slipped in
> > there.
> >
> > Apparently this code is trying to dereference an invalid pgxact, but
> > it's not clear to me how this happens. Those structs are allocated in
> > advance, and they are referenced in the code via array indexes, so even
> > if the pgxact doesn't actually hold data about a valid transaction,
> > dereferencing the XID shouldn't cause a crash.
>
> Well, that code is pretty, uh, questionable. E.g. for
> int pgprocno =
> arrayP->pgprocnos[index];
> volatile PGPROC *proc = &allProcs[pgprocno];
> volatile PGXACT *pgxact = &allPgXact[pgprocno];
> there's no guarantee that pgprocno is actually the same index for both
> lookups and the following
> if (pgprocno == -1)
> continue; /* do not count
> deleted entries */
> check. It's perfectly reasonable for a compiler to reload pgprocno from
> memory, or just always reference it via memory.
>

Hm... this is against my understanding of what a compiler could (or
should) do. Do you have a documentation reference or other evidence?

A naive disassemble dump on-my-box(tm) suggests that pgprocno is stored on
stack (with offset -0x1c) and is referenced from there:
...
0x00000000007f8011 <+53>: mov -0x18(%rbp),%rax
0x00000000007f8015 <+57>: mov -0x20(%rbp),%edx
0x00000000007f8018 <+60>: movslq %edx,%rdx
0x00000000007f801b <+63>: add $0x8,%rdx
0x00000000007f801f <+67>: mov 0x8(%rax,%rdx,4),%eax
0x00000000007f8023 <+71>: mov %eax,-0x1c(%rbp)
0x00000000007f8026 <+74>: mov 0x67b15b(%rip),%rdx # 0xe73188
<allProcs>
0x00000000007f802d <+81>: mov -0x1c(%rbp),%eax # <== here
0x00000000007f8030 <+84>: cltq
0x00000000007f8032 <+86>: imul $0x338,%rax,%rax
0x00000000007f8039 <+93>: add %rdx,%rax
0x00000000007f803c <+96>: mov %rax,-0x10(%rbp)
0x00000000007f8040 <+100>: mov 0x67b149(%rip),%rcx # 0xe73190
<allPgXact>
0x00000000007f8047 <+107>: mov -0x1c(%rbp),%eax # <== here
0x00000000007f804a <+110>: movslq %eax,%rdx
0x00000000007f804d <+113>: mov %rdx,%rax
0x00000000007f8050 <+116>: add %rax,%rax
0x00000000007f8053 <+119>: add %rdx,%rax
0x00000000007f8056 <+122>: shl $0x2,%rax
0x00000000007f805a <+126>: add %rcx,%rax
0x00000000007f805d <+129>: mov %rax,-0x8(%rbp)
0x00000000007f8061 <+133>: cmpl $0xffffffff,-0x1c(%rbp) # <== and
here
0x00000000007f8065 <+137>: je 0x7f80a4 <MinimumActiveBackends+200>
...

I presume what happened here is that initially arrayP->pgprocnos[index]
> was -1, but by the time if (pgprocno == -1) is reached, it changed to a
> different value.
>
> It's also really crummy that we're doing the PGPROC/PGXACT lookups
> before checking whether pgprocno is -1.
>

No doubt here.

At the very least ISTM that we have to make pgprocno volatile (or use a
> memory barrier - but we don't have sufficient support for those in the
> older branches), and move the PGPROC/PGXACT lookups after the == -1
> check.
>

Use of volatile doesn't change the resulting code dramatically for me.

The above is when compiling without optimizations. With -Og I'm getting
similar code for the variant with volatile and if no volatile, pgprocno is
just loaded into a register as one would expect.

--
Alex

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message syspegasus 2016-02-25 12:10:28 BUG #13990: Postgres shutting down automatically
Previous Message Brian Ghidinelli 2016-02-25 07:21:30 Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid` [WORKAROUND]