Re: Forbid the use of invalidated physical slots in streaming replication.

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Forbid the use of invalidated physical slots in streaming replication.
Date: 2023-12-08 13:44:38
Message-ID: CAExHW5t9w_iSU+VP-GqGLeRErRJnGzkqavxzYfCk8Z+e0pu-3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > > pg_replication_slot could be set back to null.
> >
> > In this case, since the basebackup was taken after the slot was invalidated, it
> > does not require the WAL that was removed. But it seems that once the
> > streaming starts, the slot sprints to life again and gets validated again. Here's
> > pg_replication_slot output after the standby starts.
>
> Actually, It doesn't bring the invalidated slot back to life completely.
> The slot's view data looks valid while the 'invalidated' flag of this slot is still
> RS_INVAL_WAL_REMOVED (user are not aware of it.)
>

I was mislead by the code in pg_get_replication_slots(). I did not
read it till the following

--- code ----
case WALAVAIL_REMOVED:

/*
* If we read the restart_lsn long enough ago, maybe that file
* has been removed by now. However, the walsender could have
* moved forward enough that it jumped to another file after
* we looked. If checkpointer signalled the process to
* termination, then it's definitely lost; but if a process is
* still alive, then "unreserved" seems more appropriate.
*
* If we do change it, save the state for safe_wal_size below.
*/
--- code ---

I see now how an invalid slot's wal status can be reported as
unreserved. So I think it's a bug.

>
> >
> > >
> > > So, I think it's a bug and one idea to fix is to check the validity of
> > > the physical slot in
> > > StartReplication() after acquiring the slot like what the attachment
> > > does, what do you think ?
> >
> > I am not sure whether that's necessarily a bug. Of course, we don't expect
> > invalid slots to be used but in this case I don't see any harm.
> > The invalid slot has been revived and has all the properties set just like a valid
> > slot. So it must be considered in
> > ReplicationSlotsComputeRequiredXmin() as well. I haven't verified it myself
> > though. In case the WAL is really lost and is requested by the standby it will
> > throw an error "requested WAL segment [0-9A-F]+ has already been
> > removed". So no harm there as well.
>
> Since the 'invalidated' field of the slot is still (RS_INVAL_WAL_REMOVED), even
> if the walsender advances the restart_lsn, the slot will not be considered in the
> ReplicationSlotsComputeRequiredXmin(), so the WALs and Rows are not safe
> and that's why I think it's a bug.
>
> After looking closer, it seems this behavior started from 15f8203 which introduced the
> ReplicationSlotInvalidationCause 'invalidated', after that we check the invalidated enum
> in Xmin/Lsn computation function.
>
> If we want to go back to previous behavior, we need to revert/adjust the check
> for invalidated in ReplicationSlotsComputeRequiredXmin(), but since the logical
> decoding(walsender) disallow using invalidated slot, so I feel it's consistent
> to do similar check for physical one. Besides, pg_replication_slot_advance()
> also disallow passing invalidated slot to it as well. What do you think ?

What happens if you run your script on a build prior to 15f8203?
Purely from reading the code, it looks like the physical slot would
sprint back to life since its restart_lsn would be updated. But I am
not able to see what happens to invalidated_at. It probably remains a
valid LSN and the slot would still not be considred in xmin
calculation. It will be good to be compatible to pre-15f8203
behaviour.

I think making logical and physical slot behaviour consistent would be
better but if the inconsistent behaviour is out there for some
releases, changing that now will break backward compatibility.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2023-12-08 13:46:47 Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Previous Message Ashutosh Bapat 2023-12-08 13:27:54 Re: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'