Re: Race conditions with checkpointer and shutdown

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Race conditions with checkpointer and shutdown
Date: 2019-04-19 02:30:22
Message-ID: 20190419023022.GG2660@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote:
> It's the latter. I searched the buildfarm database for failure logs
> including the string "server does not shut down" within the last three
> years, and got all of the hits attached. Not all of these look like
> the failure pattern Michael pointed to, but enough of them do to say
> that the problem has existed since at least mid-2017. To be concrete,
> we have quite a sample of cases where a standby server has received a
> "fast shutdown" signal and acknowledged that in its log, but it never
> gets to the expected "shutting down" message, meaning it never starts
> the shutdown checkpoint let alone finishes it. The oldest case that
> clearly looks like that is
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29

Interesting. I was sort of thinking about c6c3334 first but this
failed based on 9fcf670, which does not include the former.

> This leads me to suspect that the problem is (a) some very low-level issue
> in spinlocks or or latches or the like, or (b) a timing problem that just
> doesn't show up on generic Intel-oid platforms. The timing theory is
> maybe a bit stronger given that one test case shows this more often than
> others. I've not got any clear ideas beyond that.
>
> Anyway, this is *not* new in v12.

Indeed. It seems to me that v12 makes the problem easier to appear
though, and I got to wonder if c6c9474 is helping in that as more
cases are popping up since mid-March.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-04-19 02:37:59 Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Previous Message Michael Paquier 2019-04-19 02:23:04 Re: "make installcheck" fails in src/test/recovery