Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-11-20 05:48:21
Message-ID: CAA4eK1J9_vUpmO=+xWgwg=nb-ipcUuJhHNLkAOP-dvWVkDiV7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Nov 20, 2020 at 10:59 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Nov 20, 2020 at 10:21 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Wed, Nov 18, 2020 at 2:12 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Nov 18, 2020 at 11:19 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Nov 18, 2020 at 3:17 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > > To cut a long story short, a tablesync worker CAN in fact end up
> > > > > > processing (e.g. apply_dispatch) streaming messages.
> > > > > > So the tablesync worker CAN get into the apply_handle_stream_commit.
> > > > > > And this scenario, albeit rare, will crash.
> > > > > >
> > > > >
> > > > > Thank you for reproducing this issue. Dilip, Peter, is anyone of you
> > > > > interested in writing a fix for this?
> > > >
> > > > Hi Amit.
> > > >
> > > > FYI - Sorry, I am away/offline for the next 5 days.
> > > >
> > > > However, if this bug still remains unfixed after next Tuesday then I
> > > > can look at it then.
> > > >
> > >
> > > Fair enough. Let's see if Dilip or I can get a chance to look into
> > > this before that.
> > >
> > > > ---
> > > >
> > > > IIUC there are 2 options:
> > > > 1) Disallow streaming for the tablesync worker.
> > > > 2) Make streaming work for the tablesync worker.
> > > >
> > > > I prefer option (a) not only because of the KISS principle, but also
> > > > because this is how the tablesync worker was previously thought to
> > > > behave anyway. I expect this fix may be like the code that Dilip
> > > > already posted [1]
> > > > [1] https://www.postgresql.org/message-id/CAFiTN-uUgKpfdbwSGnn3db3mMQAeviOhQvGWE_pC9icZF7VDKg%40mail.gmail.com
> > > >
> > > > OTOH, option (b) fix may or may not be possible (I don't know), but I
> > > > have doubts that it is worthwhile to consider making a special fix for
> > > > a scenario which so far has never been reproduced outside of the
> > > > debugger.
> > > >
> > >
> > > I would prefer option (b) unless the fix is not possible due to design
> > > constraints. I don't think it is a good idea to make tablesync workers
> > > behave differently unless we have a reason for doing so.
> > >
> >
> > Okay, I will analyze this and try to submit my finding today.
>
> I have done my analysis, basically, the table sync worker is applying
> all the changes (for multiple transactions from upstream) under the
> single transaction (on downstream). Now for normal changes, we can
> just avoid committing in apply_handle_commit and everything is fine
> for streaming changes we also have the transaction at the stream level
> which we need to manage the buffiles for storing the streaming
> changes. So if we want to do that for the streaming transaction then
> we need to avoid commit transactions on apply_handle_stream_commit as
> apply_handle_stream_stop for the table sync worker.
>

And what about apply_handle_stream_abort? And, I guess we need to
avoid other related things like update of
replorigin_session_origin_lsn, replorigin_session_origin_timestamp,
etc. in apply_handle_stream_commit() as we are apply_handle_commit().

I think it is difficult to have a reliable test case for this but feel
free to propose if you have any ideas on the same.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dilip Kumar 2020-11-20 05:52:32 Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Previous Message PG Bug reporting form 2020-11-20 05:39:35 BUG #16733: insert into on conflict(pk) do nothing error violates not-null constraint