Re: long-standing data loss bug in initial sync of logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: long-standing data loss bug in initial sync of logical replication
Date: 2024-06-27 03:08:27
Message-ID: CAA4eK1+9_iyZRr0TJjLb9HK3uVzQr31UCq=obqDph5JtE=cjLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 26, 2024 at 4:57 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 6/25/24 07:04, Amit Kapila wrote:
> > On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>
> >> On 6/24/24 12:54, Amit Kapila wrote:
> >>> ...
> >>>>
> >>>>>> I'm not sure there are any cases where using SRE instead of AE would cause
> >>>>>> problems for logical decoding, but it seems very hard to prove. I'd be very
> >>>>>> surprised if just using SRE would not lead to corrupted cache contents in some
> >>>>>> situations. The cases where a lower lock level is ok are ones where we just
> >>>>>> don't care that the cache is coherent in that moment.
> >>>>
> >>>>> Are you saying it might break cases that are not corrupted now? How
> >>>>> could obtaining a stronger lock have such effect?
> >>>>
> >>>> No, I mean that I don't know if using SRE instead of AE would have negative
> >>>> consequences for logical decoding. I.e. whether, from a logical decoding POV,
> >>>> it'd suffice to increase the lock level to just SRE instead of AE.
> >>>>
> >>>> Since I don't see how it'd be correct otherwise, it's kind of a moot question.
> >>>>
> >>>
> >>> We lost track of this thread and the bug is still open. IIUC, the
> >>> conclusion is to use SRE in OpenTableList() to fix the reported issue.
> >>> Andres, Tomas, please let me know if my understanding is wrong,
> >>> otherwise, let's proceed and fix this issue.
> >>>
> >>
> >> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
> >> don't think we 'lost track' of it, but it's true we haven't done much
> >> progress recently.
> >>
> >
> > Okay, thanks for pointing to the CF entry. Would you like to take care
> > of this? Are you seeing anything more than the simple fix to use SRE
> > in OpenTableList()?
> >
>
> I did not find a simpler fix than adding the SRE, and I think pretty
> much any other fix is guaranteed to be more complex. I don't remember
> all the details without relearning all the details, but IIRC the main
> challenge for me was to convince myself it's a sufficient and reliable
> fix (and not working simply by chance).
>
> I won't have time to look into this anytime soon, so feel free to take
> care of this and push the fix.
>

Okay, I'll take care of this.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2024-06-27 03:14:02 Re: Conflict Detection and Resolution
Previous Message Melanie Plageman 2024-06-27 02:18:18 Re: Add LSN <-> time conversion functionality