Re: [Bug Fix]standby may crash when switching-over in certain special cases

From: px shi <spxlyy123(at)gmail(dot)com>
To: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Bug Fix]standby may crash when switching-over in certain special cases
Date: 2024-09-30 07:14:54
Message-ID: CAAccyYKXRVSmfC-YYdPbgsZfPiK_Tk4RLggxWs8UETxfKD7kRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for responding.

> It is odd that the standby server crashes when

replication fails because the standby would keep retrying to get the

next record even in such case.

As I mentioned earlier, when replication fails, it retries to establish
streaming replication. At this point, the value of *walrcv->flushedUpto *is
not necessarily the data actually flushed to disk. However, the startup
process mistakenly believes that the latest flushed LSN is
*walrcv->flushedUpto* and attempts to open the corresponding WAL file,
which doesn't exist, leading to a file open failure and causing the startup
process to PANIC.

Regards,
Pixian Shi

Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp> 于2024年9月30日周一 13:47写道:

> On Wed, 21 Aug 2024 09:11:03 +0800
> px shi <spxlyy123(at)gmail(dot)com> wrote:
>
> > Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> 于2024年8月21日周三 00:49写道:
> >
> > >
> > >
> > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
> standbys
> > > of
> > > > the primary server respectively, it is not surprising that s2 has
> > > progressed
> > > > far than s1 when the primary fails. I believe that this is the case
> you
> > > should
> > > > use pg_rewind. Even if flushedUpto is reset as proposed in your
> patch,
> > > s2 might
> > > > already have applied a WAL record that s1 has not processed yet, and
> > > there
> > > > would be no gurantee that subsecuent applys suceed.
> > >
> > >
> > Thank you for your response. In my scenario, s1 and s2 is the standbys
> of
> > the primary server respectively, and s1 a synchronous standby and s2 is
> an
> > asynchronous standby. You mentioned that if s2's replay progress is ahead
> > of s1, pg_rewind should be used. However, what I'm trying to address is
> an
> > issue where s2 crashes during replay after s1 has been promoted to
> primary,
> > even though s2's progress hasn't surpassed s1.
>
> I understood your point. It is odd that the standby server crashes when
> replication fails because the standby would keep retrying to get the
> next record even in such case.
>
> Regards,
> Yugo Nagata
>
> >
> > Regards,
> > Pixian Shi
>
>
> --
> Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-09-30 07:26:06 Re: Pgoutput not capturing the generated columns
Previous Message shveta malik 2024-09-30 07:05:49 Re: Conflict Detection and Resolution