From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
Cc: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Forbid the use of invalidated physical slots in streaming replication. |
Date: | 2024-02-01 16:18:38 |
Message-ID: | CALDaNm25SB9dCpu3HONRRXB6hM-DwudV7iKcVV-aTHqMWVvFaA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 8 Jan 2024 at 10:25, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Fri, 8 Dec 2023 at 19:15, Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> >
> > > > > pg_replication_slot could be set back to null.
> > > >
> > > > In this case, since the basebackup was taken after the slot was invalidated, it
> > > > does not require the WAL that was removed. But it seems that once the
> > > > streaming starts, the slot sprints to life again and gets validated again. Here's
> > > > pg_replication_slot output after the standby starts.
> > >
> > > Actually, It doesn't bring the invalidated slot back to life completely.
> > > The slot's view data looks valid while the 'invalidated' flag of this slot is still
> > > RS_INVAL_WAL_REMOVED (user are not aware of it.)
> > >
> >
> > I was mislead by the code in pg_get_replication_slots(). I did not
> > read it till the following
> >
> > --- code ----
> > case WALAVAIL_REMOVED:
> >
> > /*
> > * If we read the restart_lsn long enough ago, maybe that file
> > * has been removed by now. However, the walsender could have
> > * moved forward enough that it jumped to another file after
> > * we looked. If checkpointer signalled the process to
> > * termination, then it's definitely lost; but if a process is
> > * still alive, then "unreserved" seems more appropriate.
> > *
> > * If we do change it, save the state for safe_wal_size below.
> > */
> > --- code ---
> >
> > I see now how an invalid slot's wal status can be reported as
> > unreserved. So I think it's a bug.
> >
> > >
> > > >
> > > > >
> > > > > So, I think it's a bug and one idea to fix is to check the validity of
> > > > > the physical slot in
> > > > > StartReplication() after acquiring the slot like what the attachment
> > > > > does, what do you think ?
> > > >
> > > > I am not sure whether that's necessarily a bug. Of course, we don't expect
> > > > invalid slots to be used but in this case I don't see any harm.
> > > > The invalid slot has been revived and has all the properties set just like a valid
> > > > slot. So it must be considered in
> > > > ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it myself
> > > > though. In case the WAL is really lost and is requested by the standby it will
> > > > throw an error "requested WAL segment [0-9A-F]+ has already been
> > > > removed". So no harm there as well.
> > >
> > > Since the 'invalidated' field of the slot is still (RS_INVAL_WAL_REMOVED), even
> > > if the walsender advances the restart_lsn, the slot will not be considered in the
> > > ReplicationSlotsComputeRequiredXmin(), so the WALs and Rows are not safe
> > > and that's why I think it's a bug.
> > >
> > > After looking closer, it seems this behavior started from 15f8203 which introduced the
> > > ReplicationSlotInvalidationCause 'invalidated', after that we check the invalidated enum
> > > in Xmin/Lsn computation function.
> > >
> > > If we want to go back to previous behavior, we need to revert/adjust the check
> > > for invalidated in ReplicationSlotsComputeRequiredXmin(), but since the logical
> > > decoding(walsender) disallow using invalidated slot, so I feel it's consistent
> > > to do similar check for physical one. Besides, pg_replication_slot_advance()
> > > also disallow passing invalidated slot to it as well. What do you think ?
> >
> > What happens if you run your script on a build prior to 15f8203?
> > Purely from reading the code, it looks like the physical slot would
> > sprint back to life since its restart_lsn would be updated. But I am
> > not able to see what happens to invalidated_at. It probably remains a
> > valid LSN and the slot would still not be considred in xmin
> > calculation. It will be good to be compatible to pre-15f8203
> > behaviour.
>
> I have changed the patch status to "Waiting on Author", as some of the
> queries requested by Ashutosh are not yet addressed.
The patch which you submitted has been awaiting your attention for
quite some time now. As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible. Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.
Regards,
Vignesh
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2024-02-01 16:24:29 | Re: Move bki file pre-processing from initdb to bootstrap |
Previous Message | Michail Nikolaev | 2024-02-01 16:06:28 | Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements |