Quick Links

Re: long-standing data loss bug in initial sync of logical replication

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: long-standing data loss bug in initial sync of logical replication
Date:	2023-11-18 20:45:35
Message-ID:	3c5ad91c-330a-0c5a-974a-da3801ffc18b@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 11/18/23 19:12, Andres Freund wrote:
> Hi,
>
> On 2023-11-18 11:56:47 +0100, Tomas Vondra wrote:
>>> I guess it's not really feasible to just increase the lock level here though
>>> :(. The use of ShareUpdateExclusiveLock isn't new, and suddenly using AEL
>>> would perhaps lead to new deadlocks and such? But it also seems quite wrong.
>>>
>>
>> If this really is about the lock being too weak, then I don't see why
>> would it be wrong?
>
> Sorry, that was badly formulated. The wrong bit is the use of
> ShareUpdateExclusiveLock.
>

Ah, you meant the current lock mode seems wrong, not that changing the
locks seems wrong. Yeah, true.

>
>> If it's required for correctness, it's not really wrong, IMO. Sure, stronger
>> locks are not great ...
>>
>> I'm not sure about the risk of deadlocks. If you do
>>
>> ALTER PUBLICATION ... ADD TABLE
>>
>> it's not holding many other locks. It essentially gets a lock just a
>> lock on pg_publication catalog, and then the publication row. That's it.
>>
>> If we increase the locks from ShareUpdateExclusive to ShareRowExclusive,
>> we're making it conflict with RowExclusive. Which is just DML, and I
>> think we need to do that.
>
> From what I can tell it needs to to be an AccessExlusiveLock. Completely
> independent of logical decoding. The way the cache stays coherent is catalog
> modifications conflicting with anything that builds cache entries. We have a
> few cases where we do use lower level locks, but for those we have explicit
> analysis for why that's ok (see e.g. reloptions.c) or we block until nobody
> could have an old view of the catalog (various CONCURRENTLY) operations.
>

Yeah, I got too focused on the issue I triggered, which seems to be
fixed by using SRE (still don't understand why ...). But you're probably
right there may be other cases where SRE would not be sufficient, I
certainly can't prove it'd be safe.

>
>> So maybe that's fine? For me, a detected deadlock is better than
>> silently missing some of the data.
>
> That certainly is true.
>
>
>>> We could brute force this in the logical decoding infrastructure, by
>>> distributing invalidations from catalog modifying transactions to all
>>> concurrent in-progress transactions (like already done for historic catalog
>>> snapshot, c.f. SnapBuildDistributeNewCatalogSnapshot()). But I think that'd
>>> be a fairly significant increase in overhead.
>>>
>>
>> I have no idea what the overhead would be - perhaps not too bad,
>> considering catalog changes are not too common (I'm sure there are
>> extreme cases). And maybe we could even restrict this only to
>> "interesting" catalogs, or something like that? (However I hate those
>> weird differences in behavior, it can easily lead to bugs.)
>>
>> But it feels more like a band-aid than actually fixing the issue.
>
> Agreed.
>

... and it would no not fix the other places outside logical decoding.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: long-standing data loss bug in initial sync of logical replication at 2023-11-18 18:12:57 from Andres Freund

Responses

Re: long-standing data loss bug in initial sync of logical replication at 2023-11-18 21:05:19 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2023-11-18 21:05:19	Re: long-standing data loss bug in initial sync of logical replication
Previous Message	Alvaro Herrera	2023-11-18 19:00:04	Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock