Re: BUG #14721: Assertion of synchronous replication

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: const_sunny(at)126(dot)com
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14721: Assertion of synchronous replication
Date: 2017-06-29 04:27:00
Message-ID: CAEepm=3k=gTAn5=X_Qv=hWw9JnUxUMXCzBxTKPaHHXxKkF0+iw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Jun 29, 2017 at 2:36 PM, <const_sunny(at)126(dot)com> wrote:
> The following bug has been logged on the website:
>
> Bug reference: 14721
> Logged by: Const Zhang
> Email address: const_sunny(at)126(dot)com
> PostgreSQL version: 9.6.2
> Operating system: CentOS7

Hi Const,

Thanks for the detailed report. What type of CPU is this running on?

> I have found a bug about synchronous replication.
> At first, see the stack of the core file.
>
> 1
> (gdb) bt
> 2
> #0 0x00007fe9aab2e1d7 in raise () from /lib64/libc.so.6
> 3
> #1 0x00007fe9aab2f8c8 in abort () from /lib64/libc.so.6
> 4
> #2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
> "!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
> "FailedAssertion",
> 5
> fileName=0xcdc140
> "/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
> lineNumber=294) at

I assume this is line 298 in master today. There is some interesting
IPC going on here. SyncRepWakeQueue() does three things while holding
SyncRepLock in some other process:

thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
SHMQueueDelete(&(thisproc->syncRepLinks));
SetLatch(&(thisproc->procLatch));

Meanwhile SyncRepWaitForLSN() in your process does this:

/*
* Acquiring the lock is not needed, the latch ensures proper
* barriers. If it looks like we're done, we must really be done,
* because once walsender changes the state to SYNC_REP_WAIT_COMPLETE,
* it will never update it again, so we can't be seeing a stale value
* in that case.
*/
if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
break;

... OK, then outside the loop:

/*
* WalSender has checked our LSN and has removed us from queue. Clean up
* state and leave. It's OK to reset these shared memory fields without
* holding SyncRepLock, because any walsenders will ignore us anyway when
* we're not on the queue.
*/
Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));

I wonder your CPU core was able to see syncRepState ==
SYNC_REP_WAIT_COMPLETE, but not yet see MyProc->syncRepLinks's next
and previous members as 0. When you look at them in your debugger you
see them as zero, but that's a bit later:

> (gdb) p MyProc->syncRepLinks
> 2
> $1 = {prev = 0x0, next = 0x0}

I may be way off base here, and haven't studied all of your report
yet. But my first thought is: shouldn't SyncRepWakeQueue() do things
the other way around, with a barrier in between, like this:

SHMQueueDelete(&(thisproc->syncRepLinks));
pg_write_barrier();
thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
SetLatch(&(thisproc->procLatch));

... and then shouldn't SyncRepWaitForLSN() have a pg_read_barrier()
inserted before Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)))?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message const_sunny@126.com 2017-06-29 04:30:09 Assertion of synchronous replication
Previous Message const_sunny 2017-06-29 02:36:23 BUG #14721: Assertion of synchronous replication