Re: long-standing data loss bug in initial sync of logical replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: long-standing data loss bug in initial sync of logical replication
Date: 2024-06-26 11:27:17
Message-ID: c1e5ccd0-9681-4959-8c8a-ad4853064e98@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/25/24 07:04, Amit Kapila wrote:
> On Mon, Jun 24, 2024 at 8:06 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> On 6/24/24 12:54, Amit Kapila wrote:
>>> ...
>>>>
>>>>>> I'm not sure there are any cases where using SRE instead of AE would cause
>>>>>> problems for logical decoding, but it seems very hard to prove. I'd be very
>>>>>> surprised if just using SRE would not lead to corrupted cache contents in some
>>>>>> situations. The cases where a lower lock level is ok are ones where we just
>>>>>> don't care that the cache is coherent in that moment.
>>>>
>>>>> Are you saying it might break cases that are not corrupted now? How
>>>>> could obtaining a stronger lock have such effect?
>>>>
>>>> No, I mean that I don't know if using SRE instead of AE would have negative
>>>> consequences for logical decoding. I.e. whether, from a logical decoding POV,
>>>> it'd suffice to increase the lock level to just SRE instead of AE.
>>>>
>>>> Since I don't see how it'd be correct otherwise, it's kind of a moot question.
>>>>
>>>
>>> We lost track of this thread and the bug is still open. IIUC, the
>>> conclusion is to use SRE in OpenTableList() to fix the reported issue.
>>> Andres, Tomas, please let me know if my understanding is wrong,
>>> otherwise, let's proceed and fix this issue.
>>>
>>
>> It's in the commitfest [https://commitfest.postgresql.org/48/4766/] so I
>> don't think we 'lost track' of it, but it's true we haven't done much
>> progress recently.
>>
>
> Okay, thanks for pointing to the CF entry. Would you like to take care
> of this? Are you seeing anything more than the simple fix to use SRE
> in OpenTableList()?
>

I did not find a simpler fix than adding the SRE, and I think pretty
much any other fix is guaranteed to be more complex. I don't remember
all the details without relearning all the details, but IIRC the main
challenge for me was to convince myself it's a sufficient and reliable
fix (and not working simply by chance).

I won't have time to look into this anytime soon, so feel free to take
care of this and push the fix.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-06-26 11:35:24 Re: Buildfarm animal caiman showing a plperl test issue with newer Perl versions
Previous Message Andrew Dunstan 2024-06-26 11:23:39 Re: pgindent exit status if a file encounters an error