RE: long-standing data loss bug in initial sync of logical replication

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Shlok Kyal' <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: RE: long-standing data loss bug in initial sync of logical replication
Date: 2024-09-26 11:53:07
Message-ID: TYAPR01MB569258257A869477A27F0BD0F56A2@TYAPR01MB5692.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Shlok,

> Hi,
>
> I tried to add changes to selectively invalidate the cache to reduce
> the performance degradation during the distribution of invalidations.

Thanks for improving the patch!

>...
>
> Solution:
> 1. When we alter a publication using commands like ‘ALTER PUBLICATION
> pub_name DROP TABLE table_name’, first all tables in the publications
> are invalidated using the function ‘rel_sync_cache_relation_cb’. Then
> again ‘rel_sync_cache_publication_cb’ function is called which
> invalidates all the tables.

On my environment, rel_sync_cache_publication_cb() was called first and invalidate
all the entries, then rel_sync_cache_relation_cb() was called and the specified
entry is invalidated - hence second is NO-OP.

> This happens because of the following
> callback registered:
> CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
> rel_sync_cache_publication_cb, (Datum) 0);

But even in this case, I could understand that you want to remove the
rel_sync_cache_publication_cb() callback.

> 2. When we add/drop a schema to/from a publication using command like
> ‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first
> all tables in that schema are invalidated using
> ‘rel_sync_cache_relation_cb’ and then again
> ‘rel_sync_cache_publication_cb’ function is called which invalidates
> all the tables.

Even in this case, rel_sync_cache_publication_cb() was called first and then
rel_sync_cache_relation_cb().

>
> 3. When we alter a namespace using command like ‘ALTER SCHEMA
> schema_name RENAME to new_schema_name’ all the table in cache are
> invalidated as ‘rel_sync_cache_publication_cb’ is called due to the
> following registered callback:
> CacheRegisterSyscacheCallback(NAMESPACEOID,
> rel_sync_cache_publication_cb, (Datum) 0);
>
> So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’
> will be called instead of function ‘rel_sync_cache_publication_cb’ ,
> which invalidates only the cache of the tables which are part of that
> particular namespace. For the new function the ‘namespace id’ is added
> in the Invalidation message.

Hmm, I feel this fix is too much. Unlike ALTER PUBLICATION statements, I think
ALTER SCHEMA is rarely executed at the production stage. However, this approach
requires adding a new cache callback system, which affects the entire postgres
system; this is not very beneficial compared to the outcome. It should be discussed
on another thread to involve more people, and then we can add the improvement
after being accepted.

> Performance Comparison:
> I have run the same tests as shared in [1] and observed a significant
> decrease in the degradation with the new changes. With selective
> invalidation degradation is around ~5%. This results are an average of
> 3 runs.

IIUC, the executed workload did not contain ALTER SCHEMA command, so
third improvement did not contribute this improvement.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-09-26 12:19:46 Re: not null constraints, again
Previous Message Amit Kapila 2024-09-26 10:58:54 Re: Documentation to upgrade logical replication cluster