Re: persist logical slots to disk during shutdown checkpoint

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: persist logical slots to disk during shutdown checkpoint
Date: 2023-09-05 12:30:41
Message-ID: CAFiTN-uDHktB83QaVVn+DGw-FEj6-xkXwdcoDvWTm7hdZ=oipQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 5, 2023 at 5:04 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>

> > Can't we just have code like this? I mean we will have to make
> > ReplicationSlotMarkDirty take slot as an argument or have another version
> > which takes slot as an argument and that would be called by us as well as by
> > ReplicationSlotMarkDirty(). I mean why do we need these checks
> > (s-(data.invalidated == RS_INVAL_NONE &&
> > s->data.confirmed_flush != s->last_saved_confirmed_flush) under the
> > mutex? Walsender is shutdown so confirmed flush LSN can not move
> > concurrently and slot can not be invalidated as well because that is done by
> > checkpointer and we are in checkpointer?
>
> I agree with your analysis that the lock may be unnecessary for now and the
> code will work, but I personally feel we'd better take the spinlock.
>
> Firstly, considering our discussion on the potential extension of persisting
> the slot for online checkpoints in the future, we anyway need the lock at that
> time, so taking the lock here could avoid overlooking the need to update it
> later. And the lock also won't cause any performance or concurrency issue.

If we think that we might plan to persist on the online checkpoint as
well then better to do it now, because this is not a extension of the
feature instead we are thinking that it is wise to just persist on the
shutdown checkpoint and I think that's what the conclusion at this
point and if thats the conclusion then no point to right code in
assumption that we will change our conclusion in future.

> Additionally, if we don't take the lock, we rely on the assumption that the
> walsender will exit before the shutdown checkpoint, currently, that's true for
> logical walsender, but physical walsender can exit later than checkpointer. So,
> I am slight woirred that if we change the logical walsender's exit timing in
> the future, the assumption may not hold.
>
> Besides, for non-built-in logical replication, if someone creates their own
> walsender or other processes to send the changes and the process doesn't exit
> before the shutdown checkpoint, it may also be a problem. Although I don't have
> exsiting examples about these extensions, but I feel taking the lock would make
> it more robust.

I think our all logic is based on that the walsender is existed
already. If not then even if you check under the mutex that the
confirmed flush LSN is not changed then it can changed right after you
release the lock and then we will not be flushing the latest update of
the confirmed flush lsn to the disk and our logic of comparing
checkpoint.redo with the confirmed flush lsn might not work?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2023-09-05 12:51:01 Re: Extract numeric filed in JSONB more effectively
Previous Message Daniel Gustafsson 2023-09-05 12:12:32 Re: GUC for temporarily disabling event triggers