Re: WAL replay issue from 9.6.8 to 9.6.10

From: Dave Peticolas <dave(at)krondo(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: WAL replay issue from 9.6.8 to 9.6.10
Date: 2018-08-30 03:19:07
Message-ID: CAPRbp06vV51ZgDFYpWjv0pwN9oMwRh_SR=RBVhvDeWiwViH2vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Aug 29, 2018 at 1:50 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Wed, Aug 29, 2018 at 09:15:29AM -0700, Dave Peticolas wrote:
> > Oh, perhaps I do, depending on what you mean by worker. There are a
> couple
> > of periodic processes that connect to the server to obtain metrics. Is
> that
> > what is triggering this issue? In my case I could probably suspend them
> > until the replay has reached the desired point.
>
> That would be it. How do you decide when those begin to run and connect
> to Postgres. Do you use pg_isready or similar in a loop for sanity
> checks?
>

I do not, they just try to connect and bail if they cannot.

> > I have noticed this behavior in the past but prior to 9.6.10 restarting
> the
> > server would fix the issue. And the replay always seemed to reach a point
> > past which the problem would not re-occur.
>
> You are picking my interest here. Did you actually see the same
> problem? In 9.6.10 what happens is that I have tightened the consistent
> point checks and logic so as inconsistent page issues would actually
> show up when they should, and that those become reproducible so as we
> can track down any rogue WAL record or inconsistent behavior.
>

Yes, I've seen this problem occasionally in the past. I think only in the
9.6 series. But before 9.6.10, if I restarted the server it would start
replaying WAL again and typically when it reached the point where it
PANICed before, instead it would report a consistent state and allow
read-only connections. Sometimes it would then PANIC again after more WAL
was replayed. But eventually it would reach a point where it seemed to be
able to replay WAL indefinitely without the issue happening.

dave

In response to

Browse pgsql-general by date

  From Date Subject
Next Message saurabh shelar 2018-08-30 07:37:29 Re: Issue with psqlrc with command line.
Previous Message Michael Paquier 2018-08-29 20:50:03 Re: WAL replay issue from 9.6.8 to 9.6.10