Re: Deadlock in XLogInsert at AIX

From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Deadlock in XLogInsert at AIX
Date: 2017-01-30 14:26:20
Message-ID: 1485786380.3084.2.camel@oopsware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Konstantin,

We had observed exactly the same issues on a customer system with the
same environment and PostgreSQL 9.5.5. Additionally, we've tested on
Linux with XL/C 12 and 13 with exactly the same deadlock behavior.

So we assumed that this is somehow a compiler issue.

Am Dienstag, den 24.01.2017, 19:26 +0300 schrieb Konstantin Knizhnik:
> More information about the problem - Postgres log contains several
> records:
>
> 2017-01-24 19:15:20.272 MSK [19270462] LOG:  request to flush past
> end 
> of generated WAL; request 6/AAEBE000, currpos 6/AAEBC2B0
>
> and them correspond to the time when deadlock happen.

Yeah, the same logs here:

LOG:  request to flush past end of generated WAL; request 1/1F4C6000,
currpos 1/1F4C40E0
STATEMENT:  UPDATE pgbench_accounts SET abalance = abalance + -2653
WHERE aid = 3662494;

> There is the following comment in xlog.c concerning this message:
>
>      /*
>       * No-one should request to flush a piece of WAL that hasn't
> even been
>       * reserved yet. However, it can happen if there is a block with
> a 
> bogus
>       * LSN on disk, for example. XLogFlush checks for that situation
> and
>       * complains, but only after the flush. Here we just assume that
> to 
> mean
>       * that all WAL that has been reserved needs to be finished. In
> this
>       * corner-case, the return value can be smaller than 'upto'
> argument.
>       */
>
> So looks like it should not happen.
> The first thing to suspect is spinlock implementation which is
> different 
> for GCC and XLC.
> But ... if I rebuild Postgres without spinlocks, then the problem is 
> still reproduced.

Before we got the results from XLC on Linux (where Postgres show the
same behavior) i had a look into the spinlock implementation. If i got
it right, XLC doesn't use the ppc64 specific ones, but the fallback
implementation (system monitoring on AIX also has shown massive calls
for signal(0)...). So i tried the following patch:

diff --git a/src/include/port/atomics/arch-ppc.h
b/src/include/port/atomics/arch-ppc.h
new file mode 100644
index f901a0c..028cced
*** a/src/include/port/atomics/arch-ppc.h
--- b/src/include/port/atomics/arch-ppc.h
***************
*** 23,26 ****
--- 23,33 ----
  #define pg_memory_barrier_impl()      __asm__ __volatile__ ("sync" :
: :
"memory")
  #define pg_read_barrier_impl()                __asm__ __volatile__
("lwsync" : : : "memory")
  #define pg_write_barrier_impl()               __asm__ __volatile__
("lwsync" : : : "memory")
+
+ #elif defined(__IBMC__) || defined(__IBMCPP__)
+
+ #define pg_memory_barrier_impl()      __asm__ __volatile__ (" sync
\n"
::: "memory")
+ #define pg_read_barrier_impl()                __asm__ __volatile__ ("
lwsync \n" ::: "memory")
+ #define pg_write_barrier_impl()               __asm__ __volatile__ ("
lwsync \n" ::: "memory")
+
  #endif

This didn't change the picture, though.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-01-30 14:40:39 Re: One-shot expanded output in psql using \G
Previous Message Simon Riggs 2017-01-30 14:04:08 Re: Superowners